UCSC Genomics Institute Computing Infrastructure Information wikiGIdb http://giwiki.gi.ucsc.edu/index.php?title=Genomics_Institute_Computing_Information MediaWiki 1.40.0 first-letter Media Special Talk User User talk UCSC Genomics Institute Computing Infrastructure Information UCSC Genomics Institute Computing Infrastructure Information talk File File talk MediaWiki MediaWiki talk Template Template talk Help Help talk Category Category talk Main Page 0 1 1 2018-04-23T19:40:47Z MediaWiki default 0 wikitext text/x-wiki <strong>MediaWiki has been successfully installed.</strong> Consult the [//meta.wikimedia.org/wiki/Help:Contents User's Guide] for information on using the wiki software. == Getting started == * [//www.mediawiki.org/wiki/Special:MyLanguage/Manual:Configuration_settings Configuration settings list] * [//www.mediawiki.org/wiki/Special:MyLanguage/Manual:FAQ MediaWiki FAQ] * [https://lists.wikimedia.org/mailman/listinfo/mediawiki-announce MediaWiki release mailing list] * [//www.mediawiki.org/wiki/Special:MyLanguage/Localisation#Translation_resources Localise MediaWiki for your language] 8e0aa2f2a7829587801db67d0424d9b447e09867 11 1 2018-04-27T23:11:54Z Haifang 1 wikitext text/x-wiki [[Genomic_Institute_Computing_Information_wiki]] * [//www.mediawiki.org/wiki/Special:MyLanguage/Manual:Configuration_settings Configuration settings list] * [//www.mediawiki.org/wiki/Special:MyLanguage/Manual:FAQ MediaWiki FAQ] * [https://lists.wikimedia.org/mailman/listinfo/mediawiki-announce MediaWiki release mailing list] * [//www.mediawiki.org/wiki/Special:MyLanguage/Localisation#Translation_resources Localise MediaWiki for your language] 154946a8336c76296a08d2cf5d0f30226615fbdf 12 11 2018-04-27T23:12:18Z Haifang 1 wikitext text/x-wiki [[Genomic_Institute_Computing_Information_wiki]] 432813a447cc161af82ebd790e1a76692120a191 15 12 2018-04-30T18:29:13Z Haifang 1 wikitext text/x-wiki [[Genomic Institute Computing Information wiki]] 276d76d7dcc04b644bab16efa57556966a19492a 50 15 2018-07-02T18:53:22Z Weiler 3 wikitext text/x-wiki Welcome to the UC Santa Cruz Genomics Information Wiki! Below and dashboards for various Information Repositories related to the Genomics Institute. [[Genomics Institute Computing Information]] dfd44c5753dc4cd68c97eaa85f7e98a253dc5724 51 50 2018-07-02T18:53:37Z Weiler 3 wikitext text/x-wiki Welcome to the UC Santa Cruz Genomics Information Wiki! Below are dashboards for various Information Repositories related to the Genomics Institute. [[Genomics Institute Computing Information]] 75dcd806d48821b5f8988fa559719f39ef65bbb5 GenomicInstitute 0 3 3 2018-04-23T21:27:31Z Haifang 1 Created page with "Genomic Institute General Information Repository ==Computing Resources and Support== *[[Obtain VPN access]] *[[....]]" wikitext text/x-wiki Genomic Institute General Information Repository ==Computing Resources and Support== *[[Obtain VPN access]] *[[....]] ac2b6faff595f401c52626bb825b97bf4d094796 Obtain VPN access 0 4 4 2018-04-23T21:29:52Z Haifang 1 Created page with "If you need VPN access to POD or CIRM, please make an appointment with the SysAdmin team by emailing ''cluster-admin@soe.ucsc.edu''. In this email please provide your name, yo..." wikitext text/x-wiki If you need VPN access to POD or CIRM, please make an appointment with the SysAdmin team by emailing ''cluster-admin@soe.ucsc.edu''. In this email please provide your name, your PI's name, PI's approval for this access (an email from your PI will be fine) and what other access do you need such as an unix server account or to the OpenStack. Before your appointment please make sure you have the following: A laptop running OS X, Windows or Ubuntu and connected to '''eduroam'''. cruznet cannot connect to the VPNs. For Macs, please download and install '''Tunnelblick''' from https://tunnelblick.net/downloads.html. Select the Stable version. For Windows, please download and install '''OpenVPN Client''' from https://openvpn.net/index.php/open-source/downloads.html. Select openvpn-install-2.4.5-I601.exe For Ubuntu, please install network-manager-openvpn by typing: sudo apt-get install network-manager-openvpn if that fails, please also install the following: sudo apt-get install network-manager-openvpn-gnome Please do NOT worry about how to configure the software at this point. We will help you to set it up at your appointment. d185e06444de930e3bd4901081dd3e560a7ba23d 5 4 2018-04-23T21:36:23Z Haifang 1 wikitext text/x-wiki If you need VPN access to POD or CIRM, please make an appointment with the SysAdmin team by emailing ''cluster-admin@soe.ucsc.edu''. In this email please provide the flollowing: your name your PI's name your PI's approval for this access (an email from your PI will be fine) what other access do you need such as an unix server account or to the OpenStack Before your appointment please make sure you have the following: A laptop running OS X, Windows or Ubuntu and connected to '''eduroam'''. cruznet cannot connect to the VPNs. You can find the instructions on setting up '''eduroam''' at https://its.ucsc.edu/wireless/eduroam-config.html For Macs, please download and install '''Tunnelblick''' from https://tunnelblick.net/downloads.html. Select the Stable version. For Windows, please download and install '''OpenVPN Client''' from https://openvpn.net/index.php/open-source/downloads.html. Select openvpn-install-2.4.5-I601.exe For Ubuntu, please install network-manager-openvpn by typing: sudo apt-get install network-manager-openvpn if that fails, please also install the following: sudo apt-get install network-manager-openvpn-gnome Please do NOT worry about how to configure the software at this point. We will help you to set it up at your appointment. b535703527751cb7c327fef26f84fcca24671e93 MediaWiki:Sidebar 8 5 6 2018-04-27T21:41:11Z Haifang 1 Created page with " * navigation ** mainpage|mainpage-description ** recentchanges-url|recentchanges ** randompage-url|randompage ** helppage|help ** Genomic Institute * SEARCH * TOOLBOX * LANGU..." wikitext text/x-wiki * navigation ** mainpage|mainpage-description ** recentchanges-url|recentchanges ** randompage-url|randompage ** helppage|help ** Genomic Institute * SEARCH * TOOLBOX * LANGUAGES 635f8e7f1f46c202d9ee3e2e3e6161a0e1b64f16 7 6 2018-04-27T21:42:43Z Haifang 1 wikitext text/x-wiki [[Genomic Institute Knowledge wiki]] cfffa6f072bdd898be7cee5fca0e0a41eb5e2b30 8 7 2018-04-27T21:50:32Z Haifang 1 wikitext text/x-wiki [[Genomic Institute Computing Information wiki]] 27e16e4f74f17c77f896880ec3778a0bc74946ce 10 8 2018-04-27T23:05:39Z Haifang 1 wikitext text/x-wiki * navigation ** mainpage|mainpage ** portal-url|portal ** currentevents-url|currentevents ** recentchanges-url|recentchanges ** randompage-url|randompage ** helppage|help ** sitesupport-url|sitesupport ** gi|Genomic Institute fca722762bb92c069fe71ce61b54c57060984b83 14 10 2018-04-30T18:27:02Z Haifang 1 wikitext text/x-wiki * navigation ** mainpage|mainpage ** portal-url|portal ** currentevents-url|currentevents ** recentchanges-url|recentchanges ** randompage-url|randompage ** helppage|help ** sitesupport-url|sitesupport ** Genomic Institute|Genomic Institute 3f8ed063ff692e97435bb6996c6d35067e571b35 Genomics Institute Computing Information 0 6 9 2018-04-27T22:19:30Z Haifang 1 Created page with "Genomic Institute Computing Information Repository ==Datacenter Migration== *[[data migration using AWS S3/Glacier tutorial]] *[[data storage resources]] *[[... ...]] ==VPN..." wikitext text/x-wiki Genomic Institute Computing Information Repository ==Datacenter Migration== *[[data migration using AWS S3/Glacier tutorial]] *[[data storage resources]] *[[... ...]] ==VPN Access== *[[Requirement for users to get GI VPN access]] *[[... ...]] 001416944db7b57d95462824587bdba97deadc34 22 9 2018-05-24T17:26:59Z Jgarcia 2 /* Datacenter Migration */ wikitext text/x-wiki Genomic Institute Computing Information Repository ==Datacenter Migration== *[[data migration using AWS S3/Glacier tutorial]] *[[data storage resources]] *[[... ...]] ==VPN Access== *[[Requirement for users to get GI VPN access]] *[[... ...]] 6d6e0542554908f16614a1ce475bd1b41cb1aba3 23 22 2018-05-25T21:54:56Z Haifang 1 wikitext text/x-wiki Genomic Institute Computing Information Repository ==Datacenter Migration== *[[Public Genomics Institute Infrastructure Ready for Migration]] *[[data storage resources]] *[[... ...]] ==VPN Access== *[[Requirement for users to get GI VPN access]] *[[... ...]] c763656b224ee09e24c3e49d8fecfa5c0fd9f819 28 23 2018-06-15T18:55:26Z Haifang 1 wikitext text/x-wiki Genomic Institute Computing Information Repository ==Datacenter Migration== *[[Public Genomics Institute Infrastructure Ready for Migration]] *[[data storage resources]] *[[... ...]] ==GI Public Computing Environment== *[[How to access the public servers]] ==VPN Access== *[[Requirement for users to get GI VPN access]] *[[... ...]] bc743611ecf940dbe2237992ba32b7ff80a55917 45 28 2018-07-02T18:47:06Z Weiler 3 Weiler moved page [[Genomic Institute Computing Information wiki]] to [[Genomics Institute Computing Information]] wikitext text/x-wiki Genomic Institute Computing Information Repository ==Datacenter Migration== *[[Public Genomics Institute Infrastructure Ready for Migration]] *[[data storage resources]] *[[... ...]] ==GI Public Computing Environment== *[[How to access the public servers]] ==VPN Access== *[[Requirement for users to get GI VPN access]] *[[... ...]] bc743611ecf940dbe2237992ba32b7ff80a55917 Gi 0 7 13 2018-04-27T23:13:21Z Haifang 1 Created page with "[[Genomic_Institute_Computing_Information_wiki]]" wikitext text/x-wiki [[Genomic_Institute_Computing_Information_wiki]] 432813a447cc161af82ebd790e1a76692120a191 UC Santa Cruz Genomics Institute 0 8 16 2018-04-30T18:30:01Z Haifang 1 Created page with "[[Genomic Institute Computing Information wiki]]" wikitext text/x-wiki [[Genomic Institute Computing Information wiki]] 276d76d7dcc04b644bab16efa57556966a19492a 47 16 2018-07-02T18:48:39Z Weiler 3 wikitext text/x-wiki Welcome to the UC Santa Cruz Genomics Information Wiki! Below and dashboards for various Information Repositories related to the Genomics Institute. [[Genomics Institute Computing Information]] dfd44c5753dc4cd68c97eaa85f7e98a253dc5724 48 47 2018-07-02T18:49:17Z Weiler 3 Weiler moved page [[Genomic Institute]] to [[UC Santa Cruz Genomics Institute]] wikitext text/x-wiki Welcome to the UC Santa Cruz Genomics Information Wiki! Below and dashboards for various Information Repositories related to the Genomics Institute. [[Genomics Institute Computing Information]] dfd44c5753dc4cd68c97eaa85f7e98a253dc5724 Requirement for users to get GI VPN access 0 9 17 2018-04-30T18:37:01Z Haifang 1 Created page with "==OpenVPN Client requirement from users== If you need VPN access to POD or CIRM, please make an appointment with the SysAdmin team by emailing ''cluster-admin@soe.ucsc.edu''...." wikitext text/x-wiki ==OpenVPN Client requirement from users== If you need VPN access to POD or CIRM, please make an appointment with the SysAdmin team by emailing ''cluster-admin@soe.ucsc.edu''. In this email please provide your name, your PI's name, PI's approval for this access (an email from your PI will be fine) and what other access do you need such as an unix server account or to the OpenStack. Before your appointment please make sure you have the following: A laptop running OS X, Windows or Ubuntu and connected to '''eduroam'''. cruznet cannot connect to the VPNs. For Macs, please download and install '''Tunnelblick''' from https://tunnelblick.net/downloads.html. Select the Stable version. For Windows, please download and install '''OpenVPN Client''' from https://openvpn.net/index.php/open-source/downloads.html. Select openvpn-install-2.4.5-I601.exe For Ubuntu, please install network-manager-openvpn by typing: sudo apt-get install network-manager-openvpn if that fails, please also install the following: sudo apt-get install network-manager-openvpn-gnome Please do NOT worry about how to configure the software at this point. We will help you to set it up at your appointment. ca815c4bcc1e7c267ff78cdeb841174d580c7aa0 18 17 2018-04-30T18:37:42Z Haifang 1 /* OpenVPN Client requirement from users */ wikitext text/x-wiki If you need VPN access to POD or CIRM, please make an appointment with the SysAdmin team by emailing ''cluster-admin@soe.ucsc.edu''. In this email please provide your name, your PI's name, PI's approval for this access (an email from your PI will be fine) and what other access do you need such as an unix server account or to the OpenStack. Before your appointment please make sure you have the following: A laptop running OS X, Windows or Ubuntu and connected to '''eduroam'''. cruznet cannot connect to the VPNs. For Macs, please download and install '''Tunnelblick''' from https://tunnelblick.net/downloads.html. Select the Stable version. For Windows, please download and install '''OpenVPN Client''' from https://openvpn.net/index.php/open-source/downloads.html. Select openvpn-install-2.4.5-I601.exe For Ubuntu, please install network-manager-openvpn by typing: sudo apt-get install network-manager-openvpn if that fails, please also install the following: sudo apt-get install network-manager-openvpn-gnome Please do NOT worry about how to configure the software at this point. We will help you to set it up at your appointment. f1af2ba66e072903d991da845521d3874acca625 19 18 2018-04-30T18:39:51Z Haifang 1 wikitext text/x-wiki If you need VPN access to POD or CIRM, please make an appointment with the SysAdmin team by emailing ''cluster-admin@soe.ucsc.edu''. In this email please provide your name, your PI's name, PI's approval for this access (an email from your PI will be fine) and what other access do you need such as an unix server account or to the OpenStack. Before your appointment please make sure you have the following: A laptop running OS X, Windows or Ubuntu and connected to '''eduroam'''. cruznet cannot connect to the VPNs. You can find the instruction on how to setup '''eduroam''' at https://its.ucsc.edu/wireless/eduroam-config.html. For Macs, please download and install '''Tunnelblick''' from https://tunnelblick.net/downloads.html. Select the Stable version. For Windows, please download and install '''OpenVPN Client''' from https://openvpn.net/index.php/open-source/downloads.html. Select openvpn-install-2.4.5-I601.exe For Ubuntu, please install network-manager-openvpn by typing: sudo apt-get install network-manager-openvpn if that fails, please also install the following: sudo apt-get install network-manager-openvpn-gnome Please do NOT worry about how to configure the software at this point. We will help you to set it up at your appointment. 93bc09a050c78c6781b5f05221bc897b0f08584b 20 19 2018-04-30T18:51:56Z Haifang 1 wikitext text/x-wiki If you need VPN access to POD or CIRM, please make an appointment with the SysAdmin team by emailing ''cluster-admin@soe.ucsc.edu''. In this email please provide your name, your PI's name, PI's approval for this access (an email from your PI will be fine) and what other access do you need such as an unix server account or to the OpenStack. Before your appointment please make sure you have the following: A laptop running OS X, Windows or Ubuntu and connected to '''eduroam'''. cruznet cannot connect to the VPNs. You can find the instruction on how to setup '''eduroam''' at https://its.ucsc.edu/wireless/eduroam-config.html. For Macs, please download and install '''Tunnelblick''' from https://tunnelblick.net/downloads.html. Select the Stable version. For Windows, please download and install '''OpenVPN Client''' from https://openvpn.net/index.php/open-source/downloads.html. Select ''openvpn-install-x.x.x-xxxx.exe'' For Ubuntu, please install network-manager-openvpn by typing: sudo apt-get install network-manager-openvpn if that fails, please also install the following: sudo apt-get install network-manager-openvpn-gnome Please do NOT worry about how to configure the software at this point. We will help you to set it up at your appointment. 5fd1c6d7d36836a69c5a8f1100897010d7e84b11 21 20 2018-05-02T22:50:09Z Haifang 1 wikitext text/x-wiki If you need VPN access to POD or CIRM, please make an appointment with the SysAdmin team by emailing ''cluster-admin@soe.ucsc.edu''. In this email please provide your name your PI's name PI's approval for this access (an email from your PI will be fine) what other access do you need such as an unix server account or to the OpenStack. Before your appointment, please make sure you have the following: A laptop running OS X, Windows or Ubuntu wireless connection to '''eduroam'''. cruznet cannot connect to the VPNs. You can find the instruction on how to setup '''eduroam''' at https://its.ucsc.edu/wireless/eduroam-config.html. For Macs, please download and install '''Tunnelblick''' from https://tunnelblick.net/downloads.html. Select the Stable version. For Windows, please download and install '''OpenVPN Client''' from https://openvpn.net/index.php/open-source/downloads.html. Select ''openvpn-install-x.x.x-xxxx.exe'' For Ubuntu, please install network-manager-openvpn by typing: sudo apt-get install network-manager-openvpn if that fails, please also install the following: sudo apt-get install network-manager-openvpn-gnome Please do NOT worry about how to configure the software at this point. We will help you to set it up at your appointment. 12f8e19fcff07726c154a0b5e12f481c1e1fb3d3 35 21 2018-07-02T18:06:36Z Weiler 3 wikitext text/x-wiki If you need VPN access to the Genomics Institute firewalled/secure area (aka the "Prism" Environment), please make an appointment with the GI SysAdmin team by emailing ''cluster-admin@soe.ucsc.edu'' and Rochelle Fuller (hrfuller@ucsc.edu) requesting access. There are several requirements to gaining access to the firewalled area - please complete all these requirements '''BEFORE''' coming to have the VPN software set up for your laptop. 1: You are required to copy cluster-admin@soe.ucsc.edu on an email from your PI or supervisor requesting a VPN account for you - this email should include: Your name Your PI's name PI's approval for this access What other access do you need such as a UNIX server account or access to OpenStack. 2: You must take the NIH Public Security Refresher Course online, then print out the Completion Certificate (which should have your name on it) at the end of the training and deliver it when you come to your appointment to install the VPN software: https://irtsectraining.nih.gov/publicUser.aspx Click on the "2018 Information Security, Counterintelligence, Privacy Awareness, Records Management Refresher" link to begin the course. At the end you will be able to print out the completion certificate that should have your name on it. 3: You need to print and sign the Genomics Institute VPN User Agreement and bring it with you to your VPN software installation appointment: (link here) 4: Read and sign the last page of the NIH Data Use Agreement, located here for download: PDF DOWNLOAD 5: You will need access to the "eduroam" wireless network '''prior''' to your appointment. Other UCSC wireless networks such as "cruznet" will not work with our VPN software, so please make sure your laptop works with eduroam before coming to your appointment. Instructions on how to get on eduroam are detailed here: https://its.ucsc.edu/wireless/eduroam.html When using the VPN software off-campus, it will usually work unless the wireless network you are on has restrictions preventing it from functioning. Some other universities have such restrictions (notably UCSF), but most other wireless network and home wireless networks should work fine. 6: Before your appointment, please make sure you install the appropriate OpenVPN software on your laptop: A laptop running OS X, Windows or Ubuntu For Macs, please download and install '''Tunnelblick''' from https://tunnelblick.net/downloads.html. Select the Latest Stable version. For Windows, please download and install '''OpenVPN Client''' from https://openvpn.net/index.php/open-source/downloads.html. Select ''openvpn-install-x.x.x-xxxx.exe'' For Ubuntu, please install network-manager-openvpn by typing: sudo apt-get install network-manager-openvpn network-manager-openvpn-gnome Please do NOT worry about how to configure the software at this point. We will help you to set it up at your appointment. We will correspond with you via email on when the appointment will be. The appointment can take up to 30 minutes per person depending on whether or not any issues come up during the software setup. 076417749c02a55001d53944283885444408299b 37 35 2018-07-02T18:28:29Z Weiler 3 wikitext text/x-wiki If you need VPN access to the Genomics Institute firewalled/secure area (aka the "Prism" Environment), please make an appointment with the GI SysAdmin team by emailing ''cluster-admin@soe.ucsc.edu'' and Rochelle Fuller (hrfuller@ucsc.edu) requesting access. There are several requirements to gaining access to the firewalled area - please complete all these requirements '''BEFORE''' coming to have the VPN software set up for your laptop. 1: You are required to copy cluster-admin@soe.ucsc.edu on an email from your PI or supervisor requesting a VPN account for you - this email should include: Your name Your PI's name PI's approval for this access What other access do you need such as a UNIX server account or access to OpenStack. 2: You must take the NIH Public Security Refresher Course online, then print out the Completion Certificate (which should have your name on it) at the end of the training and deliver it when you come to your appointment to install the VPN software: https://irtsectraining.nih.gov/publicUser.aspx Click on the "2018 Information Security, Counterintelligence, Privacy Awareness, Records Management Refresher" link to begin the course. At the end you will be able to print out the completion certificate that should have your name on it. 3: You need to print and sign the Genomics Institute VPN User Agreement and bring it with you to your VPN software installation appointment: [[:File:GI_VPN_Policy.pdf]] 4: Read and sign the last page of the NIH Data Use Agreement, located here for download: PDF DOWNLOAD 5: You will need access to the "eduroam" wireless network '''prior''' to your appointment. Other UCSC wireless networks such as "cruznet" will not work with our VPN software, so please make sure your laptop works with eduroam before coming to your appointment. Instructions on how to get on eduroam are detailed here: https://its.ucsc.edu/wireless/eduroam.html When using the VPN software off-campus, it will usually work unless the wireless network you are on has restrictions preventing it from functioning. Some other universities have such restrictions (notably UCSF), but most other wireless network and home wireless networks should work fine. 6: Before your appointment, please make sure you install the appropriate OpenVPN software on your laptop: A laptop running OS X, Windows or Ubuntu For Macs, please download and install '''Tunnelblick''' from https://tunnelblick.net/downloads.html. Select the Latest Stable version. For Windows, please download and install '''OpenVPN Client''' from https://openvpn.net/index.php/open-source/downloads.html. Select ''openvpn-install-x.x.x-xxxx.exe'' For Ubuntu, please install network-manager-openvpn by typing: sudo apt-get install network-manager-openvpn network-manager-openvpn-gnome Please do NOT worry about how to configure the software at this point. We will help you to set it up at your appointment. We will correspond with you via email on when the appointment will be. The appointment can take up to 30 minutes per person depending on whether or not any issues come up during the software setup. b386f5491991936ecaf2dba7ff6abef59ab3234a 38 37 2018-07-02T18:30:15Z Weiler 3 wikitext text/x-wiki If you need VPN access to the Genomics Institute firewalled/secure area (aka the "Prism" Environment), please make an appointment with the GI SysAdmin team by emailing ''cluster-admin@soe.ucsc.edu'' and Rochelle Fuller (hrfuller@ucsc.edu) requesting access. There are several requirements to gaining access to the firewalled area - please complete all these requirements '''BEFORE''' coming to have the VPN software set up for your laptop. 1: You are required to copy cluster-admin@soe.ucsc.edu on an email from your PI or supervisor requesting a VPN account for you - this email should include: Your name Your PI's name PI's approval for this access What other access do you need such as a UNIX server account or access to OpenStack. 2: You must take the NIH Public Security Refresher Course online, then print out the Completion Certificate (which should have your name on it) at the end of the training and deliver it when you come to your appointment to install the VPN software: https://irtsectraining.nih.gov/publicUser.aspx Click on the "2018 Information Security, Counterintelligence, Privacy Awareness, Records Management Refresher" link to begin the course. At the end you will be able to print out the completion certificate that should have your name on it. 3: You need to print and sign the Genomics Institute VPN User Agreement and bring it with you to your VPN software installation appointment: [[Media:GI_VPN_Policy.pdf]] 4: Read and sign the last page of the NIH Data Use Agreement, located here for download: PDF DOWNLOAD 5: You will need access to the "eduroam" wireless network '''prior''' to your appointment. Other UCSC wireless networks such as "cruznet" will not work with our VPN software, so please make sure your laptop works with eduroam before coming to your appointment. Instructions on how to get on eduroam are detailed here: https://its.ucsc.edu/wireless/eduroam.html When using the VPN software off-campus, it will usually work unless the wireless network you are on has restrictions preventing it from functioning. Some other universities have such restrictions (notably UCSF), but most other wireless network and home wireless networks should work fine. 6: Before your appointment, please make sure you install the appropriate OpenVPN software on your laptop: A laptop running OS X, Windows or Ubuntu For Macs, please download and install '''Tunnelblick''' from https://tunnelblick.net/downloads.html. Select the Latest Stable version. For Windows, please download and install '''OpenVPN Client''' from https://openvpn.net/index.php/open-source/downloads.html. Select ''openvpn-install-x.x.x-xxxx.exe'' For Ubuntu, please install network-manager-openvpn by typing: sudo apt-get install network-manager-openvpn network-manager-openvpn-gnome Please do NOT worry about how to configure the software at this point. We will help you to set it up at your appointment. We will correspond with you via email on when the appointment will be. The appointment can take up to 30 minutes per person depending on whether or not any issues come up during the software setup. 70143398cd03dbe34d1a12d47f8072849f5d04bb 39 38 2018-07-02T18:30:52Z Weiler 3 wikitext text/x-wiki If you need VPN access to the Genomics Institute firewalled/secure area (aka the "Prism" Environment), please make an appointment with the GI SysAdmin team by emailing ''cluster-admin@soe.ucsc.edu'' and Rochelle Fuller (hrfuller@ucsc.edu) requesting access. There are several requirements to gaining access to the firewalled area - please complete all these requirements '''BEFORE''' coming to have the VPN software set up for your laptop. 1: You are required to copy cluster-admin@soe.ucsc.edu on an email from your PI or supervisor requesting a VPN account for you - this email should include: Your name Your PI's name PI's approval for this access What other access do you need such as a UNIX server account or access to OpenStack. 2: You must take the NIH Public Security Refresher Course online, then print out the Completion Certificate (which should have your name on it) at the end of the training and deliver it when you come to your appointment to install the VPN software: https://irtsectraining.nih.gov/publicUser.aspx Click on the "2018 Information Security, Counterintelligence, Privacy Awareness, Records Management Refresher" link to begin the course. At the end you will be able to print out the completion certificate that should have your name on it. 3: You need to print and sign the Genomics Institute VPN User Agreement and bring it with you to your VPN software installation appointment: [[GI_VPN_Policy.pdf]] 4: Read and sign the last page of the NIH Data Use Agreement, located here for download: PDF DOWNLOAD 5: You will need access to the "eduroam" wireless network '''prior''' to your appointment. Other UCSC wireless networks such as "cruznet" will not work with our VPN software, so please make sure your laptop works with eduroam before coming to your appointment. Instructions on how to get on eduroam are detailed here: https://its.ucsc.edu/wireless/eduroam.html When using the VPN software off-campus, it will usually work unless the wireless network you are on has restrictions preventing it from functioning. Some other universities have such restrictions (notably UCSF), but most other wireless network and home wireless networks should work fine. 6: Before your appointment, please make sure you install the appropriate OpenVPN software on your laptop: A laptop running OS X, Windows or Ubuntu For Macs, please download and install '''Tunnelblick''' from https://tunnelblick.net/downloads.html. Select the Latest Stable version. For Windows, please download and install '''OpenVPN Client''' from https://openvpn.net/index.php/open-source/downloads.html. Select ''openvpn-install-x.x.x-xxxx.exe'' For Ubuntu, please install network-manager-openvpn by typing: sudo apt-get install network-manager-openvpn network-manager-openvpn-gnome Please do NOT worry about how to configure the software at this point. We will help you to set it up at your appointment. We will correspond with you via email on when the appointment will be. The appointment can take up to 30 minutes per person depending on whether or not any issues come up during the software setup. fc002d91ee14da554bb22a432b170229497213de 40 39 2018-07-02T18:31:33Z Weiler 3 wikitext text/x-wiki If you need VPN access to the Genomics Institute firewalled/secure area (aka the "Prism" Environment), please make an appointment with the GI SysAdmin team by emailing ''cluster-admin@soe.ucsc.edu'' and Rochelle Fuller (hrfuller@ucsc.edu) requesting access. There are several requirements to gaining access to the firewalled area - please complete all these requirements '''BEFORE''' coming to have the VPN software set up for your laptop. 1: You are required to copy cluster-admin@soe.ucsc.edu on an email from your PI or supervisor requesting a VPN account for you - this email should include: Your name Your PI's name PI's approval for this access What other access do you need such as a UNIX server account or access to OpenStack. 2: You must take the NIH Public Security Refresher Course online, then print out the Completion Certificate (which should have your name on it) at the end of the training and deliver it when you come to your appointment to install the VPN software: https://irtsectraining.nih.gov/publicUser.aspx Click on the "2018 Information Security, Counterintelligence, Privacy Awareness, Records Management Refresher" link to begin the course. At the end you will be able to print out the completion certificate that should have your name on it. 3: You need to print and sign the Genomics Institute VPN User Agreement and bring it with you to your VPN software installation appointment: [[Media:GI_VPN_Policy.pdf]] 4: Read and sign the last page of the NIH Data Use Agreement, located here for download: PDF DOWNLOAD 5: You will need access to the "eduroam" wireless network '''prior''' to your appointment. Other UCSC wireless networks such as "cruznet" will not work with our VPN software, so please make sure your laptop works with eduroam before coming to your appointment. Instructions on how to get on eduroam are detailed here: https://its.ucsc.edu/wireless/eduroam.html When using the VPN software off-campus, it will usually work unless the wireless network you are on has restrictions preventing it from functioning. Some other universities have such restrictions (notably UCSF), but most other wireless network and home wireless networks should work fine. 6: Before your appointment, please make sure you install the appropriate OpenVPN software on your laptop: A laptop running OS X, Windows or Ubuntu For Macs, please download and install '''Tunnelblick''' from https://tunnelblick.net/downloads.html. Select the Latest Stable version. For Windows, please download and install '''OpenVPN Client''' from https://openvpn.net/index.php/open-source/downloads.html. Select ''openvpn-install-x.x.x-xxxx.exe'' For Ubuntu, please install network-manager-openvpn by typing: sudo apt-get install network-manager-openvpn network-manager-openvpn-gnome Please do NOT worry about how to configure the software at this point. We will help you to set it up at your appointment. We will correspond with you via email on when the appointment will be. The appointment can take up to 30 minutes per person depending on whether or not any issues come up during the software setup. 70143398cd03dbe34d1a12d47f8072849f5d04bb 41 40 2018-07-02T18:35:07Z Weiler 3 wikitext text/x-wiki If you need VPN access to the Genomics Institute firewalled/secure area (aka the "Prism" Environment), please make an appointment with the GI SysAdmin team by emailing ''cluster-admin@soe.ucsc.edu'' and Rochelle Fuller (hrfuller@ucsc.edu) requesting access. There are several requirements to gaining access to the firewalled area - please complete all these requirements '''BEFORE''' coming to have the VPN software set up for your laptop. 1: You are required to copy cluster-admin@soe.ucsc.edu on an email from your PI or supervisor requesting a VPN account for you - this email should include: Your name Your PI's name PI's approval for this access What other access do you need such as a UNIX server account or access to OpenStack. 2: You must take the NIH Public Security Refresher Course online, then print out the Completion Certificate (which should have your name on it) at the end of the training and deliver it when you come to your appointment to install the VPN software: https://irtsectraining.nih.gov/publicUser.aspx Click on the "2018 Information Security, Counterintelligence, Privacy Awareness, Records Management Refresher" link to begin the course. At the end you will be able to print out the completion certificate that should have your name on it. 3: You need to print and sign the Genomics Institute VPN User Agreement and bring it with you to your VPN software installation appointment, located here for download: [[Media:GI_VPN_Policy.pdf]] 4: Read and sign the last page of the NIH Genomic Data Sharing Policy agreement, located here for download: PDF DOWNLOAD 5: You will need access to the "eduroam" wireless network '''prior''' to your appointment. Other UCSC wireless networks such as "cruznet" will not work with our VPN software, so please make sure your laptop works with eduroam before coming to your appointment. Instructions on how to get on eduroam are detailed here: https://its.ucsc.edu/wireless/eduroam.html When using the VPN software off-campus, it will usually work unless the wireless network you are on has restrictions preventing it from functioning. Some other universities have such restrictions (notably UCSF), but most other wireless network and home wireless networks should work fine. 6: Before your appointment, please make sure you install the appropriate OpenVPN software on your laptop: A laptop running OS X, Windows or Ubuntu For Macs, please download and install '''Tunnelblick''' from https://tunnelblick.net/downloads.html. Select the Latest Stable version. For Windows, please download and install '''OpenVPN Client''' from https://openvpn.net/index.php/open-source/downloads.html. Select ''openvpn-install-x.x.x-xxxx.exe'' For Ubuntu, please install network-manager-openvpn by typing: sudo apt-get install network-manager-openvpn network-manager-openvpn-gnome Please do NOT worry about how to configure the software at this point. We will help you to set it up at your appointment. We will correspond with you via email on when the appointment will be. The appointment can take up to 30 minutes per person depending on whether or not any issues come up during the software setup. 06740b57436352cc96bbd072210b884fe676739f 43 41 2018-07-02T18:41:41Z Weiler 3 wikitext text/x-wiki If you need VPN access to the Genomics Institute firewalled/secure area (aka the "Prism" Environment), please make an appointment with the GI SysAdmin team by emailing ''cluster-admin@soe.ucsc.edu'' and Rochelle Fuller (hrfuller@ucsc.edu) requesting access. There are several requirements to gaining access to the firewalled area - please complete all these requirements '''BEFORE''' coming to have the VPN software set up for your laptop. 1: You are required to copy cluster-admin@soe.ucsc.edu on an email from your PI or supervisor requesting a VPN account for you - this email should include: Your name Your PI's name PI's approval for this access What other access do you need such as a UNIX server account or access to OpenStack. 2: You must take the NIH Public Security Refresher Course online, then print out the Completion Certificate (which should have your name on it) at the end of the training and deliver it when you come to your appointment to install the VPN software: https://irtsectraining.nih.gov/publicUser.aspx Click on the "2018 Information Security, Counterintelligence, Privacy Awareness, Records Management Refresher" link to begin the course. At the end you will be able to print out the completion certificate that should have your name on it. 3: You need to print and sign the Genomics Institute VPN User Agreement and bring it with you to your VPN software installation appointment, located here for download: [[Media:GI_VPN_Policy.pdf]] 4: Please print, read and sign the last page of the NIH Genomic Data Sharing Policy agreement, located here for download. Just staple the pages together. Please bring the signed document to your appointment. By signing the document you agree that you have read and understand the policies described therein and that you agree to abide by those policies: [[Media:NIH_GDS_Policy.pdf]] 5: You will need access to the "eduroam" wireless network '''prior''' to your appointment. Other UCSC wireless networks such as "cruznet" will not work with our VPN software, so please make sure your laptop works with eduroam before coming to your appointment. Instructions on how to get on eduroam are detailed here: https://its.ucsc.edu/wireless/eduroam.html When using the VPN software off-campus, it will usually work unless the wireless network you are on has restrictions preventing it from functioning. Some other universities have such restrictions (notably UCSF), but most other wireless network and home wireless networks should work fine. 6: Before your appointment, please make sure you install the appropriate OpenVPN software on your laptop: A laptop running OS X, Windows or Ubuntu For Macs, please download and install '''Tunnelblick''' from https://tunnelblick.net/downloads.html. Select the Latest Stable version. For Windows, please download and install '''OpenVPN Client''' from https://openvpn.net/index.php/open-source/downloads.html. Select ''openvpn-install-x.x.x-xxxx.exe'' For Ubuntu, please install network-manager-openvpn by typing: sudo apt-get install network-manager-openvpn network-manager-openvpn-gnome Please do NOT worry about how to configure the software at this point. We will help you to set it up at your appointment. We will correspond with you via email on when the appointment will be. The appointment can take up to 30 minutes per person depending on whether or not any issues come up during the software setup. If you show up for your appointment without one (or more) of the above outlined requirements, we will have to reschedule your appointment for a time when you can arrive after completing the above requirements. dbaf1fd8ebefe74ee1028791303844a27fb779a1 44 43 2018-07-02T18:44:09Z Weiler 3 wikitext text/x-wiki If you need VPN access to the Genomics Institute firewalled/secure area (aka the "Prism" Environment), please make an appointment with the GI SysAdmin team by emailing ''cluster-admin@soe.ucsc.edu'' requesting access. There are several requirements to gaining access to the firewalled area - please complete all these requirements '''BEFORE''' coming to have the VPN software set up for your laptop. 1: You are required to ask your PI or sponsor to email cluster-admin@soe.ucsc.edu requesting a VPN account for you - this email should include: Your name Your PI's name Your requested username (If your name is Jane Doe, then your username could be 'jdoe' for example). PI's approval for this access What other access you need such as a UNIX server account or access to OpenStack. 2: You must take the NIH Public Security Refresher Course online, then print out the Completion Certificate (which should have your name on it) at the end of the training and deliver it when you come to your appointment to install the VPN software: https://irtsectraining.nih.gov/publicUser.aspx Click on the "2018 Information Security, Counterintelligence, Privacy Awareness, Records Management Refresher" link to begin the course. At the end you will be able to print out the completion certificate that should have your name on it. 3: You need to print and sign the Genomics Institute VPN User Agreement and bring it with you to your VPN software installation appointment, located here for download: [[Media:GI_VPN_Policy.pdf]] 4: Please print, read and sign the last page of the NIH Genomic Data Sharing Policy agreement, located here for download. Just staple the pages together. Please bring the signed document to your appointment. By signing the document you agree that you have read and understand the policies described therein and that you agree to abide by those policies: [[Media:NIH_GDS_Policy.pdf]] 5: You will need access to the "eduroam" wireless network '''prior''' to your appointment. Other UCSC wireless networks such as "cruznet" will not work with our VPN software, so please make sure your laptop works with eduroam before coming to your appointment. Instructions on how to get on eduroam are detailed here: https://its.ucsc.edu/wireless/eduroam.html When using the VPN software off-campus, it will usually work unless the wireless network you are on has restrictions preventing it from functioning. Some other universities have such restrictions (notably UCSF), but most other wireless network and home wireless networks should work fine. 6: Before your appointment, please make sure you install the appropriate OpenVPN software on your laptop: A laptop running OS X, Windows or Ubuntu For Macs, please download and install '''Tunnelblick''' from https://tunnelblick.net/downloads.html. Select the Latest Stable version. For Windows, please download and install '''OpenVPN Client''' from https://openvpn.net/index.php/open-source/downloads.html. Select ''openvpn-install-x.x.x-xxxx.exe'' For Ubuntu, please install network-manager-openvpn by typing: sudo apt-get install network-manager-openvpn network-manager-openvpn-gnome Please do NOT worry about how to configure the software at this point. We will help you to set it up at your appointment. We will correspond with you via email on when the appointment will be. The appointment can take up to 30 minutes per person depending on whether or not any issues come up during the software setup. If you show up for your appointment without one (or more) of the above outlined requirements, we will have to reschedule your appointment for a time when you can arrive after completing the above requirements. 5bbb9ecbaed585ca22d2ff677e5f88061daf48b0 Public Genomics Institute Infrastructure Ready for Migration 0 10 24 2018-05-25T22:06:19Z Haifang 1 Created page with " The new Genomics Institute public infrastructure is ready to begin account creation and data migration. It consists of the shared compute server ‘juggernaut’ attached to..." wikitext text/x-wiki The new Genomics Institute public infrastructure is ready to begin account creation and data migration. It consists of the shared compute server ‘juggernaut’ attached to a home directory and file storage server. We will keep the SDSC public ‘kollosus’ server and attached storage available *until June 29th* to allow time for migrating data. We will be adding additional compute after this migration period. We’re working on the private infrastructure (i.e. replacement for pod) and anticipate having it available for migration in the next few weeks with a similar migration window - we will email with updates. ==Getting An Account== Starting Tuesday May 29th contact Haifang <haifang@ucsc.edu <mailto:haifang@ucsc.edu>> to setup an account on the new system. Each account includes a backed up home directory initially limited to '''30GB'''. ==Large Data Storage== If you require more space provide Haifang with the name of a PI or funded project. We will create a shared folder under /public/groups/<PI or project name> that you and others associated with the same PI or project will have access to. If you work with multiple PI’s or projects you will have access to each of their shared folders. The lab/project may organize data under that folder in any structure, but the overall total size of a pi/project top level folder under groups will be generally limited to '''10TB''' during migration as we get a better idea of how much data all groups require. This storage will be on reliable RAID6 storage but it will not be backed up and should be primarily used for data that you are actively working with. For backup and long term archival we suggest setting up an AWS account and using Glacier. 0589c37d240984d208e8c5d10e17b52c4499e67c 25 24 2018-05-25T22:13:15Z Haifang 1 wikitext text/x-wiki The new Genomics Institute public infrastructure is ready to begin account creation and data migration. It consists of the shared compute server '''courtyard.gi.ucsc.edu''' attached to a home directory and file storage server. We will keep the SDSC public ‘kollosus’ server and attached storage available '''until June 29th''' to allow time for migrating data. We will be adding additional compute after this migration period. We’re working on the private infrastructure (i.e. replacement for pod) and anticipate having it available for migration in the next few weeks with a similar migration window - we will email with updates. ==Getting An Account== Starting Tuesday May 29th contact Haifang <haifang@ucsc.edu <mailto:haifang@ucsc.edu>> to setup an account on the new system. Each account includes a backed up home directory initially limited to '''30GB'''. ==Large Data Storage== If you require more space provide Haifang with the name of a PI or funded project. We will create a shared folder under /public/groups/<PI or project name> that you and others associated with the same PI or project will have access to. If you work with multiple PI’s or projects you will have access to each of their shared folders. The lab/project may organize data under that folder in any structure, but the overall total size of a pi/project top level folder under groups will be generally limited to '''10TB''' during migration as we get a better idea of how much data all groups require. This storage will be on reliable RAID6 storage but it will not be backed up and should be primarily used for data that you are actively working with. For backup and long term archival we suggest setting up an AWS account and using Glacier. ==Migrating Data== Once you have an account you can migrate data from the old to the new infrastructure via rsync. For large shared storage please coordinate with others in the lab and/or project. For help with this contact cluster-admin@soe.ucsc.edu d024e803ff274a39d066104261748dfb9d9422a9 26 25 2018-05-25T22:18:53Z Haifang 1 wikitext text/x-wiki The new Genomics Institute public infrastructure is ready to begin account creation and data migration. It consists of the shared compute server '''courtyard.gi.ucsc.edu''' attached to a home directory and file storage server. We will keep the SDSC public ‘kolossus’ server and attached storage available '''until June 29th''' to allow time for migrating data. We will be adding additional compute after this migration period. We’re working on the private infrastructure (i.e. replacement for pod) and anticipate having it available for migration in the next few weeks with a similar migration window - we will email with updates. ==Getting An Account== Starting Tuesday May 29th contact the SysAdmin at ''cluster-admin@soe.ucsc.edu'' to setup an account on the new system. Each account includes a backed up home directory initially limited to '''30GB'''. ==Large Data Storage== If you require more space provide Haifang with the name of a PI or funded project. We will create a shared folder under /public/groups/<PI or project name> that you and others associated with the same PI or project will have access to. If you work with multiple PI’s or projects you will have access to each of their shared folders. The lab/project may organize data under that folder in any structure, but the overall total size of a pi/project top level folder under groups will be generally limited to '''10TB''' during migration as we get a better idea of how much data all groups require. This storage will be on reliable RAID6 storage but it will not be backed up and should be primarily used for data that you are actively working with. For backup and long term archival we suggest setting up an AWS account and using Glacier. ==Migrating Data== Once you have an account you can migrate data from the old to the new infrastructure via rsync. For large shared storage please coordinate with others in the lab and/or project. For help with this contact ''cluster-admin@soe.ucsc.edu'' 300ff1429ac166d0ab2be73b77b0cecb35265b69 27 26 2018-05-25T22:27:36Z Haifang 1 wikitext text/x-wiki The new Genomics Institute public infrastructure is ready to begin account creation and data migration. It consists of the shared compute server '''courtyard.gi.ucsc.edu''' attached to a home directory and file storage server. We will keep the SDSC public '''kolossus''' server and attached storage available '''until June 29th''' to allow time for migrating data. We will be adding additional compute after this migration period. We’re working on the private infrastructure (i.e. replacement for pod) and anticipate having it available for migration in the next few weeks with a similar migration window - we will email with updates. ==Getting An Account== Starting Tuesday May 29th contact the SysAdmin at ''cluster-admin@soe.ucsc.edu'' to setup an account on the new system. Each account includes a backed up home directory initially limited to '''30GB'''. ==Large Data Storage== If you require more space provide the admin with the name of a PI or funded project. We will create a shared directory under /public/groups/<PI or project name> that you and others associated with the same PI or project will have access to. If you work with multiple PI’s or projects you will have access to each of their shared directories. The lab/project may organize data under that directory in any structure, but the overall total size of a pi/project top level directory under groups will be generally limited to '''10TB''' during migration as we get a better idea of how much data all groups require. This storage will be on reliable RAID6 storage but it will not be backed up and should be primarily used for data that you are actively working with. For backup and long term archival we suggest setting up an AWS account and using Glacier. ==Migrating Data== Once you have an account you can migrate data from the old to the new infrastructure via ''rsync''. For large shared storage please coordinate with others in the lab and/or project. For help with this contact ''cluster-admin@soe.ucsc.edu'' 76dcaba63d6a9997864dbcb97b5f8334d781ea8f How to access the public servers 0 11 29 2018-06-15T19:25:17Z Haifang 1 Created page with "To Genomic Institute public server is courtyard.gi.ucsc.edu To access the server, first request an account by emailing cluster-admin@soe.ucsc.edu. Please provide your full na..." wikitext text/x-wiki To Genomic Institute public server is courtyard.gi.ucsc.edu To access the server, first request an account by emailing cluster-admin@soe.ucsc.edu. Please provide your full name, your username on the server kolossus and the name of your PI and the lab or project you are working with. If you are affiliated with Genomic Institute, but do not have an account on kolossus, please make an appointment with the SysAdmin and setup your account in person. You can also request to be added to a group. A group for a PI is named as ''PI's_lastname_lab'' or the name of a project such as ''treehouse''. After you get the account, you can login by typing: ssh ''your_username''@courtyard.gi.ucsc.edu Your home directory path is /public/home/''your_username''. Your home directory quota is 30GB. A group's directory's path is /public/''groups/name_of_the_group'' You can create your directory under there. You share the disk space with your group mates. The quota for your group is 10TB. ea9f8d1a2a61628a49e12a09a7773a598c2be6c2 30 29 2018-06-15T19:26:08Z Haifang 1 wikitext text/x-wiki The Genomic Institute public server is courtyard.gi.ucsc.edu To access the server, first request an account by emailing cluster-admin@soe.ucsc.edu. Please provide your full name, your username on the server kolossus and the name of your PI and the lab or project you are working with. If you are affiliated with Genomic Institute, but do not have an account on kolossus, please make an appointment with the SysAdmin and setup your account in person. You can also request to be added to a group. A group for a PI is named as ''PI's_lastname_lab'' or the name of a project such as ''treehouse''. After you get the account, you can login by typing: ssh ''your_username''@courtyard.gi.ucsc.edu Your home directory path is /public/home/''your_username''. Your home directory quota is 30GB. A group's directory's path is /public/''groups/name_of_the_group'' You can create your directory under there. You share the disk space with your group mates. The quota for your group is 10TB. 85df174d6e6d34ef3d7aa8d10eb43245bd6a9614 31 30 2018-06-15T19:31:41Z Haifang 1 wikitext text/x-wiki The Genomic Institute public server is courtyard.gi.ucsc.edu To access the server, first request an account by emailing cluster-admin@soe.ucsc.edu. If you already have an account on kolossus, please provide your full name, your username on the server kolossus and the name of your PI and the lab or project you are working with. If you are affiliated with Genomic Institute, but do not have an account on kolossus, please make an appointment with the SysAdmin to setup your account in person. You can also request to be added to a group. A group for a PI is named as ''PI's_lastname_lab'' or the name of a project such as ''treehouse''. After you get the account, you can login by typing: ssh ''your_username''@courtyard.gi.ucsc.edu Your home directory path is /public/home/''your_username''. Your home directory quota is 30GB. A group's directory's path is /public/groups/''name_of_the_group'' You can create your directory under there. You share the disk space with your group mates. The quota for your group is 10TB. d258ebf4ed8d9cfc7427ee0940981fdfe75f38d3 32 31 2018-06-21T21:16:34Z Haifang 1 wikitext text/x-wiki The Genomic Institute public server is courtyard.gi.ucsc.edu To access the server, first request an account by emailing cluster-admin@soe.ucsc.edu. If you already have an account on kolossus, please provide your full name, your username on the server kolossus and the name of your PI and the lab or project you are working with. If you are affiliated with Genomic Institute, but do not have an account on kolossus, please make an appointment with the SysAdmin to setup your account in person. You can also request to be added to a group. A group for a PI is named as ''PI's_lastname_lab'' or the name of a project such as ''treehouse''. After you get the account, you can login by typing: ssh ''your_username''@courtyard.gi.ucsc.edu Your home directory path is /public/home/''your_username''. Your home directory quota is 30GB. A group's directory's path is /public/groups/''name_of_the_group'' You can create your directory under there. You share the disk space with your group mates. The quota for your group is 10TB. You can use '''''rsync''''' to copy files from kolossus.sdsc.edu to courtyard.gi.ucsc.edu. If needed, you can read about rsync [https://linux.die.net/man/1/rsync / here]. ba1fe9d9eab2398663104cc47036182ddf62aece 33 32 2018-06-21T21:17:16Z Haifang 1 wikitext text/x-wiki The Genomic Institute public server is courtyard.gi.ucsc.edu To access the server, first request an account by emailing cluster-admin@soe.ucsc.edu. If you already have an account on kolossus, please provide your full name, your username on the server kolossus and the name of your PI and the lab or project you are working with. If you are affiliated with Genomic Institute, but do not have an account on kolossus, please make an appointment with the SysAdmin to setup your account in person. You can also request to be added to a group. A group for a PI is named as ''PI's_lastname_lab'' or the name of a project such as ''treehouse''. After you get the account, you can login by typing: ssh ''your_username''@courtyard.gi.ucsc.edu Your home directory path is /public/home/''your_username''. Your home directory quota is 30GB. A group's directory's path is /public/groups/''name_of_the_group'' You can create your directory under there. You share the disk space with your group mates. The quota for your group is 10TB. You can use '''''rsync''''' to copy files from kolossus.sdsc.edu to courtyard.gi.ucsc.edu. If needed, you can read about rsync [https://linux.die.net/man/1/rsync here]. 89e2fcb519ebb0c01049a111a3bf8369321fdf08 34 33 2018-06-21T21:18:13Z Haifang 1 wikitext text/x-wiki The Genomic Institute public server is courtyard.gi.ucsc.edu To access the server, first request an account by emailing cluster-admin@soe.ucsc.edu. If you already have an account on kolossus, please provide your full name, your username on the server kolossus and the name of your PI and the lab or project you are working with. If you are affiliated with Genomic Institute, but do not have an account on kolossus, please make an appointment with the SysAdmin to setup your account in person. You can also request to be added to a group. A group for a PI is named as ''PI's_lastname_lab'' or the name of a project such as ''treehouse''. After you get the account, you can login by typing: ssh ''your_username''@courtyard.gi.ucsc.edu Your home directory path is /public/home/''your_username''. Your home directory quota is 30GB. A group's directory's path is /public/groups/''name_of_the_group'' You can create your directory under there. You share the disk space with your group mates. The quota for your group is 10TB. You can use '''''rsync''''' to copy files from kolossus.sdsc.edu to courtyard.gi.ucsc.edu. If needed, you can read about '''''rsync''''' [https://linux.die.net/man/1/rsync here]. a6339e528c39261599e49d13f7e2d4be75b20bb8 File:GI VPN Policy.pdf 6 12 36 2018-07-02T18:26:49Z Weiler 3 Genomics Institute VPN User Policy wikitext text/x-wiki Genomics Institute VPN User Policy 1497b06cd4180a84d9994df5b17c3083121becc1 File:NIH GDS Policy.pdf 6 13 42 2018-07-02T18:36:19Z Weiler 3 NIH Genomic Data Sharing Policy wikitext text/x-wiki NIH Genomic Data Sharing Policy eee54c0fdb7c55dfb2ccbefbf2c828ac89b713aa Genomic Institute Computing Information wiki 0 14 46 2018-07-02T18:47:07Z Weiler 3 Weiler moved page [[Genomic Institute Computing Information wiki]] to [[Genomics Institute Computing Information]] wikitext text/x-wiki #REDIRECT [[Genomics Institute Computing Information]] ebdb8ea87497b6dd8a6b0567cb5fc43c9f4eaa71 Genomic Institute 0 15 49 2018-07-02T18:49:17Z Weiler 3 Weiler moved page [[Genomic Institute]] to [[UC Santa Cruz Genomics Institute]] wikitext text/x-wiki #REDIRECT [[UC Santa Cruz Genomics Institute]] 1b88a6d5b11ca1e78e4548b24998292003ae8f7d Main Page 0 1 52 51 2018-07-02T18:53:52Z Weiler 3 wikitext text/x-wiki Welcome to the UC Santa Cruz Genomics Institute Information Wiki! Below are dashboards for various Information Repositories related to the Genomics Institute. [[Genomics Institute Computing Information]] dea55e8bade97fdb236ca5f399006f8c143d6c93 MediaWiki:Sidebar 8 5 53 14 2018-07-02T18:55:17Z Weiler 3 wikitext text/x-wiki * navigation ** mainpage|mainpage ** portal-url|portal ** recentchanges-url|recentchanges ** helppage|help ** sitesupport-url|sitesupport 9876c759155fff7261a6e50108ea95cb5cd4dc00 73 53 2018-07-13T01:34:45Z Weiler 3 wikitext text/x-wiki * navigation ** mainpage|Genomics Institute Computing Information ** portal-url|portal ** recentchanges-url|recentchanges ** helppage|help ** sitesupport-url|sitesupport f8ffdf1b795b4b243d05fdfc99f15bbafe6c5772 74 73 2018-07-13T01:35:16Z Weiler 3 wikitext text/x-wiki * navigation ** Genomics Institute Computing Information|Genomics Institute Computing Information ** portal-url|portal ** recentchanges-url|recentchanges ** helppage|help ** sitesupport-url|sitesupport 09f03b50be4898802f0d25637a69fccdb21c254d Genomics Institute Computing Information 0 6 54 45 2018-07-02T18:55:56Z Weiler 3 wikitext text/x-wiki Genomic Institute Computing Information Repository ==Datacenter Migration== *[[Public Genomics Institute Infrastructure Ready for Migration]] *[[data storage resources]] ==GI Public Computing Environment== *[[How to access the public servers]] ==VPN Access== *[[Requirement for users to get GI VPN access]] 7878c6737bd8fb74eba3ea7611671f29e68fc7fd 61 54 2018-07-12T20:25:03Z Haifang 1 /* VPN Access */ wikitext text/x-wiki Genomic Institute Computing Information Repository ==Datacenter Migration== *[[Public Genomics Institute Infrastructure Ready for Migration]] *[[data storage resources]] ==GI Public Computing Environment== *[[How to access the public servers]] ==VPN Access== *[[Requirement for users to get GI VPN access]] *[[Requirement for users to get POD VPN access]] cfac32fefae10b55a663ba91c2536dc0791a8e13 65 61 2018-07-13T00:52:38Z Weiler 3 wikitext text/x-wiki Genomic Institute Computing Information Repository ==Datacenter Migration== *[[Public Genomics Institute Infrastructure Ready for Migration]] *[[data storage resources]] ==GI Public Computing Environment== *[[How to access the public servers]] ==GI Firewalled Computing Environment== *[[Access to the Firewalled Compute Servers]] ==VPN Access== *[[Requirement for users to get GI VPN access]] *[[Requirement for users to get POD VPN access]] 3fc8bb432896831f8be85e1da3341e1a2e9652ae 70 65 2018-07-13T01:25:35Z Weiler 3 wikitext text/x-wiki Genomic Institute Computing Information Repository ==Datacenter Migration== *[[Public Genomics Institute Infrastructure Ready for Migration]] ==GI Public Computing Environment== *[[How to access the public servers]] ==GI Firewalled Computing Environment== *[[Access to the Firewalled Compute Servers]] ==VPN Access== *[[Requirement for users to get GI VPN access]] *[[Requirement for users to get POD VPN access]] 9d8e97ccec7a7346e9b934b187516faab2ea7dd7 71 70 2018-07-13T01:26:02Z Weiler 3 wikitext text/x-wiki Genomic Institute Computing Information Repository ==Datacenter Migration== *[[Public Genomics Institute Infrastructure Ready for Migration]] ==GI Public Computing Environment== *[[How to access the public servers]] ==GI Firewalled Computing Environment (PRISM)== *[[Access to the Firewalled Compute Servers]] ==VPN Access== *[[Requirement for users to get GI VPN access]] *[[Requirement for users to get POD VPN access]] 96ad8af7b2f6015a5ba360b0dbf28a81bed9efbe 76 71 2018-07-13T01:38:58Z Weiler 3 wikitext text/x-wiki Welcome to the Genomic Institute Computing Information Repository! Browse the below topics for help in the area you are curious about. ==Datacenter Migration== *[[Public Genomics Institute Infrastructure Ready for Migration]] ==GI Public Computing Environment== *[[How to access the public servers]] ==GI Firewalled Computing Environment (PRISM)== *[[Access to the Firewalled Compute Servers]] ==VPN Access== *[[Requirement for users to get GI VPN access]] *[[Requirement for users to get POD VPN access]] 183d2ee451814257505934b3c92a3f208b6c806e 86 76 2018-07-23T16:38:41Z Haifang 1 /* VPN Access */ wikitext text/x-wiki Welcome to the Genomic Institute Computing Information Repository! Browse the below topics for help in the area you are curious about. ==Datacenter Migration== *[[Public Genomics Institute Infrastructure Ready for Migration]] ==GI Public Computing Environment== *[[How to access the public servers]] ==GI Firewalled Computing Environment (PRISM)== *[[Access to the Firewalled Compute Servers]] ==VPN Access== *[[Instructions and Requirements for users to get GI VPN access]] *[[Requirement for users to get POD VPN access]] f0d0d371d06ce5909d9c62b26a4af08bf52d3589 87 86 2018-07-23T16:40:12Z Haifang 1 Undo revision 86 by [[Special:Contributions/Haifang|Haifang]] ([[User talk:Haifang|talk]]) wikitext text/x-wiki Welcome to the Genomic Institute Computing Information Repository! Browse the below topics for help in the area you are curious about. ==Datacenter Migration== *[[Public Genomics Institute Infrastructure Ready for Migration]] ==GI Public Computing Environment== *[[How to access the public servers]] ==GI Firewalled Computing Environment (PRISM)== *[[Access to the Firewalled Compute Servers]] ==VPN Access== *[[Requirement for users to get GI VPN access]] *[[Requirement for users to get POD VPN access]] 183d2ee451814257505934b3c92a3f208b6c806e 97 87 2018-08-22T17:45:34Z Weiler 3 wikitext text/x-wiki Welcome to the Genomic Institute Computing Information Repository! Browse the below topics for help in the area you are curious about. ==Datacenter Migration== *[[Public Genomics Institute Infrastructure Ready for Migration]] ==GI Public Computing Environment== *[[How to access the public servers]] ==GI Firewalled Computing Environment (PRISM)== *[[Access to the Firewalled Compute Servers]] ==VPN Access== *[[Requirement for users to get GI VPN access]] 954b9433b7efb40db7695cf359e7a66fb317f6c9 98 97 2018-08-27T16:24:50Z Weiler 3 wikitext text/x-wiki Welcome to the Genomic Institute Computing Information Repository! Browse the below topics for help in the area you are curious about. ==Datacenter Migration== *[[Public Genomics Institute Infrastructure Ready for Migration]] ==GI Public Computing Environment== *[[How to access the public servers]] ==GI Firewalled Computing Environment (PRISM)== *[[Access to the Firewalled Compute Servers]] ==VPN Access== *[[Requirement for users to get GI VPN access]] == NIH dbGaP Access Requirements == *[[Requirements for dbGaP Access]] ce96786aa5e923fd5c322ff3ae42003036f604f0 101 98 2018-09-19T16:12:49Z Weiler 3 wikitext text/x-wiki Welcome to the Genomic Institute Computing Information Repository! Browse the below topics for help in the area you are curious about. ==Datacenter Migration== *[[Public Genomics Institute Infrastructure Ready for Migration]] ==GI Public Computing Environment== *[[How to access the public servers]] ==GI Firewalled Computing Environment (PRISM)== *[[Access to the Firewalled Compute Servers]] ==VPN Access== *[[Requirement for users to get GI VPN access]] == NIH dbGaP Access Requirements == *[[Requirements for dbGaP Access]] == giCloud Openstack == *[[Overview of '''giCloud''' in the Genomics Institute]] 05d49d8b73506fc70ef7fcfccfdc8ec5cd7cdf90 Requirement for users to get GI VPN access 0 9 55 44 2018-07-02T18:58:28Z Weiler 3 wikitext text/x-wiki If you need VPN access to the Genomics Institute firewalled/secure area (aka the "Prism" Environment), please make an appointment with the GI SysAdmin team by emailing ''cluster-admin@soe.ucsc.edu'' requesting access. There are several requirements to gaining access to the firewalled area - please complete all these requirements '''BEFORE''' coming to have the VPN software set up for your laptop. 1: You are required to ask your PI or sponsor to email cluster-admin@soe.ucsc.edu requesting a VPN account for you - this email should include: Your name Your PI's name Your requested username (If your name is Jane Doe, then your username could be 'jdoe' for example). PI's approval for this access What other access you need such as a UNIX server account or access to OpenStack. 2: You must take the NIH Public Security Refresher Course online, then print out the Completion Certificate (which should have your name on it) at the end of the training and deliver it when you come to your appointment to install the VPN software: https://irtsectraining.nih.gov/publicUser.aspx Click on the "2018 Information Security, Counterintelligence, Privacy Awareness, Records Management Refresher" link to begin the course. At the end you will be able to print out the completion certificate that should have your name on it. 3: You need to print and sign the Genomics Institute VPN User Agreement and bring it with you to your VPN software installation appointment, located here for download: [[Media:GI_VPN_Policy.pdf]] 4: Please print, read and sign the last page of the NIH Genomic Data Sharing Policy agreement, located here for download. Just staple the pages together. Please bring the signed document to your appointment. By signing the document you agree that you have read and understand the policies described therein and that you agree to abide by those policies: [[Media:NIH_GDS_Policy.pdf]] 5: You will need access to the "eduroam" wireless network '''prior''' to your appointment. Other UCSC wireless networks such as "cruznet" will not work with our VPN software, so please make sure your laptop works with eduroam before coming to your appointment. Instructions on how to get on eduroam are detailed here: https://its.ucsc.edu/wireless/eduroam.html When using the VPN software off-campus, it will usually work unless the wireless network you are on has restrictions preventing it from functioning. Some other universities have such restrictions (notably UCSF), but most other wireless network and home wireless networks should work fine. 6: Before your appointment, please make sure you install the appropriate OpenVPN software on your laptop: A laptop running OS X, Windows or Ubuntu For Macs, please download and install '''Tunnelblick''' from https://tunnelblick.net/downloads.html. Select the Latest Stable version. For Windows, please download and install '''OpenVPN Client''' from https://openvpn.net/index.php/open-source/downloads.html. Select ''openvpn-install-x.x.x-xxxx.exe'' For Ubuntu, please install network-manager-openvpn by typing: sudo apt-get install network-manager-openvpn network-manager-openvpn-gnome Please do NOT worry about how to configure the software at this point. We will help you to set it up at your appointment. We will correspond with you via email on when the appointment will be. The appointment can take up to 30 minutes per person depending on whether or not any issues come up during the software setup. If you show up for your appointment without one (or more) of the above outlined requirements, we will have to reschedule your appointment for a time when you can arrive after completing the above requirements. 6734d25dca1b77f5cd2eaeb50b575cc9ac611093 56 55 2018-07-02T19:14:30Z Weiler 3 wikitext text/x-wiki If you need VPN access to the Genomics Institute firewalled/secure area (aka the "Prism" Environment), please make an appointment with the GI SysAdmin team by emailing ''cluster-admin@soe.ucsc.edu'' requesting access. There are several requirements to gaining access to the firewalled area - please complete all these requirements '''BEFORE''' coming to have the VPN software set up for your laptop. 1: You are required to ask your PI or sponsor to email cluster-admin@soe.ucsc.edu requesting a VPN account for you - this email should include: Your name Your PI's name Your requested username (If your name is Jane Doe, then your username could be 'jdoe' for example). PI's approval for this access What other access you need such as a UNIX server account or access to OpenStack. 2: You must take the NIH Public Security Refresher Course online, then print out the Completion Certificate (which should have your name on it) at the end of the training and deliver it when you come to your appointment to install the VPN software: https://irtsectraining.nih.gov/publicUser.aspx Click on the "2018 Information Security, Counterintelligence, Privacy Awareness, Records Management Refresher" link to begin the course. At the end you will be able to print out the completion certificate that should have your name on it. 3: You need to print and sign the Genomics Institute VPN User Agreement and bring it with you to your VPN software installation appointment, located here for download: [[Media:GI_VPN_Policy.pdf]] 4: Please print, read and sign the last page of the NIH Genomic Data Sharing Policy agreement, located here for download. Just staple the pages together. Please bring the signed document to your appointment. By signing the document you agree that you have read and understand the policies described therein and that you agree to abide by those policies: [[Media:NIH_GDS_Policy.pdf]] 5: You will need access to the "eduroam" wireless network '''prior''' to your appointment. Other UCSC wireless networks such as "cruznet" will not work with our VPN software, so please make sure your laptop works with eduroam before coming to your appointment. Instructions on how to get on eduroam are detailed here: https://its.ucsc.edu/wireless/eduroam.html When using the VPN software off-campus, it will usually work unless the wireless network you are on has restrictions preventing it from functioning. Some other universities have such restrictions (notably UCSF), but most other wireless network and home wireless networks should work fine. 6: Before your appointment, please make sure you install the appropriate OpenVPN software on your laptop: A laptop running OS X, Windows or Ubuntu For Macs, please download and install '''Tunnelblick''' from https://tunnelblick.net/downloads.html. Select the Latest Stable version. For Windows, please download and install '''OpenVPN Client''' from https://openvpn.net/index.php/open-source/downloads.html. Select ''openvpn-install-x.x.x-xxxx.exe'' For Ubuntu, please install network-manager-openvpn by typing: sudo apt-get install network-manager-openvpn network-manager-openvpn-gnome Please do NOT worry about how to configure the software at this point. We will help you to set it up at your appointment. We will correspond with you via email on when the appointment will be. The appointment can take up to 30 minutes per person depending on whether or not any issues come up during the software setup. If you show up for your appointment without one (or more) of the above outlined requirements, we will have to reschedule your appointment for a time when you can arrive after completing the above requirements. '''PLEASE NOTE:''' Because of the overhead required in setting up VPN access, please only request access if you have an immediate need to work on data that exists behind the firewall. We have had a decent number of people request access and go through the setup but then never use it. In other words, please do not request access because "one day you might need it", but because you '''do''' actually need it! 465a8dae859bc24cb409ff729697445938a9aa65 57 56 2018-07-02T22:59:23Z Weiler 3 wikitext text/x-wiki If you need VPN access to the Genomics Institute firewalled/secure area (aka the "Prism" Environment), please make an appointment with the GI SysAdmin team by emailing ''cluster-admin@soe.ucsc.edu'' requesting access. There are several requirements to gaining access to the firewalled area - please complete all these requirements '''BEFORE''' coming to have the VPN software set up for your laptop. 1: You are required to ask your PI or sponsor to email cluster-admin@soe.ucsc.edu requesting a VPN account for you - this email should include: Your name Your PI's name Your requested username (If your name is Jane Doe, then your username could be 'jdoe' for example). PI's approval for this access What other access you need such as a UNIX server account or access to OpenStack. 2: You must take the NIH Public Security Refresher Course online, then print out the Completion Certificate (which should have your name on it) at the end of the training and deliver it when you come to your appointment to install the VPN software: https://irtsectraining.nih.gov/publicUser.aspx Click on the "2018 Information Security, Counterintelligence, Privacy Awareness, Records Management Refresher" link to begin the course. At the end you will be able to print out the completion certificate that should have your name on it. 3: You need to print and sign the Genomics Institute VPN User Agreement and bring it with you to your VPN software installation appointment, located here for download: [[Media:GI_VPN_Policy.pdf]] 4: Please print, read and sign the last page of the NIH Genomic Data Sharing Policy agreement, located here for download. Just staple the pages together. Please bring the signed document to your appointment. By signing the document you agree that you have read and understand the policies described therein and that you agree to abide by those policies: [[Media:NIH_GDS_Policy.pdf]] 5: You will need access to the "eduroam" wireless network '''prior''' to your appointment. Other UCSC wireless networks such as "cruznet" will not work with our VPN software, so please make sure your laptop works with eduroam before coming to your appointment. Instructions on how to get on eduroam are detailed here: https://its.ucsc.edu/wireless/eduroam.html When using the VPN software off-campus, it will usually work unless the wireless network you are on has restrictions preventing it from functioning. Some other universities have such restrictions (notably UCSF), but most other wireless network and home wireless networks should work fine. 6: Before your appointment, please make sure you install the appropriate OpenVPN software on your laptop: A laptop running OS X, Windows or Ubuntu For Macs, please download and install '''Tunnelblick''' from https://tunnelblick.net/downloads.html. Select the Latest Stable version. For Windows, please download and install '''OpenVPN Client''' from https://openvpn.net/index.php/open-source/downloads.html. Select ''openvpn-install-x.x.x-xxxx.exe'' For Ubuntu, please install network-manager-openvpn by typing: sudo apt-get install network-manager-openvpn network-manager-openvpn-gnome Please do NOT worry about how to configure the software at this point. We will help you to set it up at your appointment. We will correspond with you via email on when the appointment will be. The appointment can take up to 30 minutes per person depending on whether or not any issues come up during the software setup. If you show up for your appointment without one (or more) of the above outlined requirements, we will have to reschedule your appointment for a time when you can arrive after completing the above requirements. '''PLEASE NOTE:''' Because of the overhead required in setting up VPN access, please only request access if you have an immediate need to work on data that exists behind the firewall. We have had a decent number of people request access and go through the setup but then never use it. In other words, please do not request access because "one day you might need it", but because you '''do''' actually need it! 6f45a8b6eb5306dcf5d56d244fdc15b9b3fb7448 60 57 2018-07-12T20:20:58Z Haifang 1 wikitext text/x-wiki If you need VPN access to POD, please make an appointment with the SysAdmin team by emailing ''cluster-admin@soe.ucsc.edu''. In this email please provide your name your PI's name PI's approval for this access (an email from your PI will be fine) what other access do you need such as an unix server account or to the OpenStack. Before your appointment, please make sure you have the following: A laptop running OS X, Windows or Ubuntu wireless connection to '''eduroam'''. cruznet cannot connect to the VPNs. You can find the instruction on how to setup '''eduroam''' at https://its.ucsc.edu/wireless/eduroam-config.html. '''For Macs''', please download and install '''Tunnelblick''' from https://tunnelblick.net/downloads.html. Select the Stable version. '''For Windows''', please download and install '''OpenVPN Client''' from https://openvpn.net/index.php/open-source/downloads.html. Select ''openvpn-install-x.x.x-xxxx.exe'' '''For Ubuntu''', please install network-manager-openvpn by typing: sudo apt-get install network-manager-openvpn if that fails, please also install the following: sudo apt-get install network-manager-openvpn-gnome Please do NOT worry about how to configure the software at this point. We will help you to set it up at your appointment. 2dc214accc84efed700fed576489a053e3ae9202 62 60 2018-07-12T20:25:49Z Haifang 1 Reverted edits by [[Special:Contributions/Haifang|Haifang]] ([[User talk:Haifang|talk]]) to last revision by [[User:Weiler|Weiler]] wikitext text/x-wiki If you need VPN access to the Genomics Institute firewalled/secure area (aka the "Prism" Environment), please make an appointment with the GI SysAdmin team by emailing ''cluster-admin@soe.ucsc.edu'' requesting access. There are several requirements to gaining access to the firewalled area - please complete all these requirements '''BEFORE''' coming to have the VPN software set up for your laptop. 1: You are required to ask your PI or sponsor to email cluster-admin@soe.ucsc.edu requesting a VPN account for you - this email should include: Your name Your PI's name Your requested username (If your name is Jane Doe, then your username could be 'jdoe' for example). PI's approval for this access What other access you need such as a UNIX server account or access to OpenStack. 2: You must take the NIH Public Security Refresher Course online, then print out the Completion Certificate (which should have your name on it) at the end of the training and deliver it when you come to your appointment to install the VPN software: https://irtsectraining.nih.gov/publicUser.aspx Click on the "2018 Information Security, Counterintelligence, Privacy Awareness, Records Management Refresher" link to begin the course. At the end you will be able to print out the completion certificate that should have your name on it. 3: You need to print and sign the Genomics Institute VPN User Agreement and bring it with you to your VPN software installation appointment, located here for download: [[Media:GI_VPN_Policy.pdf]] 4: Please print, read and sign the last page of the NIH Genomic Data Sharing Policy agreement, located here for download. Just staple the pages together. Please bring the signed document to your appointment. By signing the document you agree that you have read and understand the policies described therein and that you agree to abide by those policies: [[Media:NIH_GDS_Policy.pdf]] 5: You will need access to the "eduroam" wireless network '''prior''' to your appointment. Other UCSC wireless networks such as "cruznet" will not work with our VPN software, so please make sure your laptop works with eduroam before coming to your appointment. Instructions on how to get on eduroam are detailed here: https://its.ucsc.edu/wireless/eduroam.html When using the VPN software off-campus, it will usually work unless the wireless network you are on has restrictions preventing it from functioning. Some other universities have such restrictions (notably UCSF), but most other wireless network and home wireless networks should work fine. 6: Before your appointment, please make sure you install the appropriate OpenVPN software on your laptop: A laptop running OS X, Windows or Ubuntu For Macs, please download and install '''Tunnelblick''' from https://tunnelblick.net/downloads.html. Select the Latest Stable version. For Windows, please download and install '''OpenVPN Client''' from https://openvpn.net/index.php/open-source/downloads.html. Select ''openvpn-install-x.x.x-xxxx.exe'' For Ubuntu, please install network-manager-openvpn by typing: sudo apt-get install network-manager-openvpn network-manager-openvpn-gnome Please do NOT worry about how to configure the software at this point. We will help you to set it up at your appointment. We will correspond with you via email on when the appointment will be. The appointment can take up to 30 minutes per person depending on whether or not any issues come up during the software setup. If you show up for your appointment without one (or more) of the above outlined requirements, we will have to reschedule your appointment for a time when you can arrive after completing the above requirements. '''PLEASE NOTE:''' Because of the overhead required in setting up VPN access, please only request access if you have an immediate need to work on data that exists behind the firewall. We have had a decent number of people request access and go through the setup but then never use it. In other words, please do not request access because "one day you might need it", but because you '''do''' actually need it! 6f45a8b6eb5306dcf5d56d244fdc15b9b3fb7448 78 62 2018-07-16T19:20:30Z Weiler 3 wikitext text/x-wiki If you need VPN access to the Genomics Institute firewalled/secure area (aka the "Prism" Environment), please make an appointment with the GI SysAdmin team by emailing ''cluster-admin@soe.ucsc.edu'' requesting access. There are several requirements to gaining access to the firewalled area - please complete all these requirements '''BEFORE''' coming to have the VPN software set up for your laptop. 1: You are required to ask your PI or sponsor to email cluster-admin@soe.ucsc.edu requesting a VPN account for you - this email should include: Your name Your PI's name Your requested username (If your name is Jane Doe, then your username could be 'jdoe' for example). PI's approval for this access What other access you need such as a UNIX server account or access to OpenStack. 2: You must take the NIH Public Security Refresher Course online, then print out the Completion Certificate (which should have your name on it) at the end of the training and deliver it when you come to your appointment to install the VPN software: https://irtsectraining.nih.gov/publicUser.aspx Click on the "2018 Information Security, Counterintelligence, Privacy Awareness, Records Management Refresher" link to begin the course. At the end you will be able to print out the completion certificate that should have your name on it. 3: You need to print and sign the Genomics Institute VPN User Agreement and bring it with you to your VPN software installation appointment, located here for download: [[Media:GI_VPN_Policy.pdf]] 4: Please print, read and sign the last page of the NIH Genomic Data Sharing Policy agreement, located here for download. Just staple the pages together. Please bring the signed document to your appointment. By signing the document you agree that you have read and understand the policies described therein and that you agree to abide by those policies: [[Media:NIH_GDS_Policy.pdf]] 5: You will need access to the "eduroam" wireless network '''prior''' to your appointment. Other UCSC wireless networks such as "cruznet" will not work with our VPN software, so please make sure your laptop works with eduroam before coming to your appointment. Instructions on how to get on eduroam are detailed here: https://its.ucsc.edu/wireless/eduroam.html When using the VPN software off-campus, it will usually work unless the wireless network you are on has restrictions preventing it from functioning. Some other universities have such restrictions (notably UCSF), but most other wireless network and home wireless networks should work fine. 6: Before your appointment, please make sure you install the appropriate OpenVPN software on your laptop: A laptop running OS X, Windows or Ubuntu For Macs, please download and install '''Tunnelblick''' from https://tunnelblick.net/downloads.html. Select the Latest Stable version. For Windows, please download and install '''OpenVPN Client''' from https://openvpn.net/index.php/open-source/downloads.html. Select ''openvpn-install-x.x.x-xxxx.exe'' For Ubuntu, please install network-manager-openvpn by typing: sudo apt-get install network-manager-openvpn network-manager-openvpn-gnome Please do NOT worry about how to configure the software at this point. We will help you to set it up at your appointment. We will correspond with you via email on when the appointment will be. The appointment can take up to 30 minutes per person depending on whether or not any issues come up during the software setup. If you show up for your appointment without one (or more) of the above outlined requirements, we will have to reschedule your appointment for a time when you can arrive after completing the above requirements. '''PLEASE NOTE:''' Because of the overhead required in setting up VPN access, please only request access if you have an immediate need to work on data that exists behind the firewall. We have had a decent number of people request access and go through the setup but then never use it. In other words, please do not request access because "one day you might need it", but because you '''do''' actually need it! '''ALSO NOTE:''' VPN accounts expire after one year from the date of first gaining access. To renew for another year you will need your PI/sponsor to send us a note asking for renewal. c12039b2efb795068bfb3f2eccc64f7b224edae4 79 78 2018-07-16T20:34:39Z Haifang 1 wikitext text/x-wiki If you need VPN access to the Genomics Institute firewalled/secure area (aka the "Prism" Environment), please make an appointment with the GI SysAdmin team by emailing ''cluster-admin@soe.ucsc.edu'' requesting access. There are several requirements to gaining access to the firewalled area - please complete all these requirements '''BEFORE''' coming to have the VPN software set up for your laptop. Please use this checklist to make sure that you have complete all '''six''' requirements explained in details below. '''1'''. User info, your PI info and your PI's approval '''2'''. NIH Public Security Refresher Course Certificate '''3'''. Signed Genomics Institute VPN User Agreement '''4'''. Singed NIH Genomic Data Sharing Policy Agreement '''5'''. "eduroam" wireless network has setup on your laptop '''6'''. Install the appropriate OpenVPN software on your laptop '''1''': You are required to ask your PI or sponsor to email cluster-admin@soe.ucsc.edu requesting a VPN account for you - this email should include: Your name Your PI's name Your requested username (If your name is Jane Doe, then your username could be 'jdoe' for example). PI's approval for this access What other access you need such as a UNIX server account or access to OpenStack. '''2''': You must take the NIH Public Security Refresher Course online, then print out the Completion Certificate (which should have your name on it) at the end of the training and deliver it when you come to your appointment to install the VPN software: https://irtsectraining.nih.gov/publicUser.aspx Click on the "2018 Information Security, Counterintelligence, Privacy Awareness, Records Management Refresher" link to begin the course. At the end you will be able to print out the completion certificate that should have your name on it. '''3''': You need to print and sign the Genomics Institute VPN User Agreement and bring it with you to your VPN software installation appointment, located here for download: [[Media:GI_VPN_Policy.pdf]] '''4''': Please print, read and sign the last page of the NIH Genomic Data Sharing Policy agreement, located here for download. Just staple the pages together. Please bring the signed document to your appointment. By signing the document you agree that you have read and understand the policies described therein and that you agree to abide by those policies: [[Media:NIH_GDS_Policy.pdf]] '''5''': You will need access to the "eduroam" wireless network '''prior''' to your appointment. Other UCSC wireless networks such as "cruznet" will not work with our VPN software, so please make sure your laptop works with eduroam before coming to your appointment. Instructions on how to get on eduroam are detailed here: https://its.ucsc.edu/wireless/eduroam.html When using the VPN software off-campus, it will usually work unless the wireless network you are on has restrictions preventing it from functioning. Some other universities have such restrictions (notably UCSF), but most other wireless network and home wireless networks should work fine. '''6''': Before your appointment, please make sure you install the appropriate OpenVPN software on your laptop: A laptop running OS X, Windows or Ubuntu For Macs, please download and install '''Tunnelblick''' from https://tunnelblick.net/downloads.html. Select the Latest Stable version. For Windows, please download and install '''OpenVPN Client''' from https://openvpn.net/index.php/open-source/downloads.html. Select ''openvpn-install-x.x.x-xxxx.exe'' For Ubuntu, please install network-manager-openvpn by typing: sudo apt-get install network-manager-openvpn network-manager-openvpn-gnome Please do NOT worry about how to configure the software at this point. We will help you to set it up at your appointment. We will correspond with you via email on when the appointment will be. The appointment can take up to 30 minutes per person depending on whether or not any issues come up during the software setup. If you show up for your appointment without one (or more) of the above outlined requirements, we will have to reschedule your appointment for a time when you can arrive after completing the above requirements. '''PLEASE NOTE:''' Because of the overhead required in setting up VPN access, please only request access if you have an immediate need to work on data that exists behind the firewall. We have had a decent number of people request access and go through the setup but then never use it. In other words, please do not request access because "one day you might need it", but because you '''do''' actually need it! '''ALSO NOTE:''' VPN accounts expire after one year from the date of first gaining access. To renew for another year you will need your PI/sponsor to send us a note asking for renewal. 6e08d332f90539f1a8d4b83021d386f3f5faa469 80 79 2018-07-16T20:35:45Z Haifang 1 wikitext text/x-wiki If you need VPN access to the Genomics Institute firewalled/secure area (aka the "Prism" Environment), please make an appointment with the GI SysAdmin team by emailing ''cluster-admin@soe.ucsc.edu'' requesting access. There are several requirements to gaining access to the firewalled area - please complete all these requirements '''BEFORE''' coming to have the VPN software set up for your laptop. Please use this checklist to make sure that you have complete all '''six''' requirements. '''1'''. User info, your PI info and your PI's approval '''2'''. NIH Public Security Refresher Course Certificate '''3'''. Signed Genomics Institute VPN User Agreement '''4'''. Singed NIH Genomic Data Sharing Policy Agreement '''5'''. "eduroam" wireless network has setup on your laptop '''6'''. Install the appropriate OpenVPN software on your laptop '''1''': You are required to ask your PI or sponsor to email cluster-admin@soe.ucsc.edu requesting a VPN account for you - this email should include: Your name Your PI's name Your requested username (If your name is Jane Doe, then your username could be 'jdoe' for example). PI's approval for this access What other access you need such as a UNIX server account or access to OpenStack. '''2''': You must take the NIH Public Security Refresher Course online, then print out the Completion Certificate (which should have your name on it) at the end of the training and deliver it when you come to your appointment to install the VPN software: https://irtsectraining.nih.gov/publicUser.aspx Click on the "2018 Information Security, Counterintelligence, Privacy Awareness, Records Management Refresher" link to begin the course. At the end you will be able to print out the completion certificate that should have your name on it. '''3''': You need to print and sign the Genomics Institute VPN User Agreement and bring it with you to your VPN software installation appointment, located here for download: [[Media:GI_VPN_Policy.pdf]] '''4''': Please print, read and sign the last page of the NIH Genomic Data Sharing Policy agreement, located here for download. Just staple the pages together. Please bring the signed document to your appointment. By signing the document you agree that you have read and understand the policies described therein and that you agree to abide by those policies: [[Media:NIH_GDS_Policy.pdf]] '''5''': You will need access to the "eduroam" wireless network '''prior''' to your appointment. Other UCSC wireless networks such as "cruznet" will not work with our VPN software, so please make sure your laptop works with eduroam before coming to your appointment. Instructions on how to get on eduroam are detailed here: https://its.ucsc.edu/wireless/eduroam.html When using the VPN software off-campus, it will usually work unless the wireless network you are on has restrictions preventing it from functioning. Some other universities have such restrictions (notably UCSF), but most other wireless network and home wireless networks should work fine. '''6''': Before your appointment, please make sure you install the appropriate OpenVPN software on your laptop: A laptop running OS X, Windows or Ubuntu For Macs, please download and install '''Tunnelblick''' from https://tunnelblick.net/downloads.html. Select the Latest Stable version. For Windows, please download and install '''OpenVPN Client''' from https://openvpn.net/index.php/open-source/downloads.html. Select ''openvpn-install-x.x.x-xxxx.exe'' For Ubuntu, please install network-manager-openvpn by typing: sudo apt-get install network-manager-openvpn network-manager-openvpn-gnome Please do NOT worry about how to configure the software at this point. We will help you to set it up at your appointment. We will correspond with you via email on when the appointment will be. The appointment can take up to 30 minutes per person depending on whether or not any issues come up during the software setup. If you show up for your appointment without one (or more) of the above outlined requirements, we will have to reschedule your appointment for a time when you can arrive after completing the above requirements. '''PLEASE NOTE:''' Because of the overhead required in setting up VPN access, please only request access if you have an immediate need to work on data that exists behind the firewall. We have had a decent number of people request access and go through the setup but then never use it. In other words, please do not request access because "one day you might need it", but because you '''do''' actually need it! '''ALSO NOTE:''' VPN accounts expire after one year from the date of first gaining access. To renew for another year you will need your PI/sponsor to send us a note asking for renewal. 28b08dfd0a750debb5ab447e808241f72e12edab 81 80 2018-07-16T20:37:51Z Haifang 1 wikitext text/x-wiki If you need VPN access to the Genomics Institute firewalled/secure area (aka the "Prism" Environment), please make an appointment with the GI SysAdmin team by emailing ''cluster-admin@soe.ucsc.edu'' requesting access. There are several requirements to gaining access to the firewalled area - please complete all these requirements '''BEFORE''' coming to have the VPN software set up for your laptop. Please use this checklist to make sure that you have complete all '''six''' requirements. '''1'''. User info, your PI info and your PI's approval '''2'''. NIH Public Security Refresher Course Certificate '''3'''. Signed Genomics Institute VPN User Agreement '''4'''. Singed NIH Genomic Data Sharing Policy Agreement '''5'''. "eduroam" wireless network has setup on your laptop '''6'''. Installed the appropriate OpenVPN software on your laptop '''1''': You are required to ask your PI or sponsor to email cluster-admin@soe.ucsc.edu requesting a VPN account for you - this email should include: Your name Your PI's name Your requested username (If your name is Jane Doe, then your username could be 'jdoe' for example). PI's approval for this access What other access you need such as a UNIX server account or access to OpenStack. '''2''': You must take the NIH Public Security Refresher Course online, then print out the Completion Certificate (which should have your name on it) at the end of the training and deliver it when you come to your appointment to install the VPN software: https://irtsectraining.nih.gov/publicUser.aspx Click on the "2018 Information Security, Counterintelligence, Privacy Awareness, Records Management Refresher" link to begin the course. At the end you will be able to print out the completion certificate that should have your name on it. '''3''': You need to print and sign the Genomics Institute VPN User Agreement and bring it with you to your VPN software installation appointment, located here for download: [[Media:GI_VPN_Policy.pdf]] '''4''': Please print, read and sign the last page of the NIH Genomic Data Sharing Policy agreement, located here for download. Just staple the pages together. Please bring the signed document to your appointment. By signing the document you agree that you have read and understand the policies described therein and that you agree to abide by those policies: [[Media:NIH_GDS_Policy.pdf]] '''5''': You will need access to the "eduroam" wireless network '''prior''' to your appointment. Other UCSC wireless networks such as "cruznet" will not work with our VPN software, so please make sure your laptop works with eduroam before coming to your appointment. Instructions on how to get on eduroam are detailed here: https://its.ucsc.edu/wireless/eduroam.html When using the VPN software off-campus, it will usually work unless the wireless network you are on has restrictions preventing it from functioning. Some other universities have such restrictions (notably UCSF), but most other wireless network and home wireless networks should work fine. '''6''': Before your appointment, please make sure you install the appropriate OpenVPN software on your laptop: A laptop running OS X, Windows or Ubuntu For Macs, please download and install '''Tunnelblick''' from https://tunnelblick.net/downloads.html. Select the Latest Stable version. For Windows, please download and install '''OpenVPN Client''' from https://openvpn.net/index.php/open-source/downloads.html. Select ''openvpn-install-x.x.x-xxxx.exe'' For Ubuntu, please install network-manager-openvpn by typing: sudo apt-get install network-manager-openvpn network-manager-openvpn-gnome Please do NOT worry about how to configure the software at this point. We will help you to set it up at your appointment. We will correspond with you via email on when the appointment will be. The appointment can take up to 30 minutes per person depending on whether or not any issues come up during the software setup. If you show up for your appointment without one (or more) of the above outlined requirements, we will have to reschedule your appointment for a time when you can arrive after completing the above requirements. '''PLEASE NOTE:''' Because of the overhead required in setting up VPN access, please only request access if you have an immediate need to work on data that exists behind the firewall. We have had a decent number of people request access and go through the setup but then never use it. In other words, please do not request access because "one day you might need it", but because you '''do''' actually need it! '''ALSO NOTE:''' VPN accounts expire after one year from the date of first gaining access. To renew for another year you will need your PI/sponsor to send us a note asking for renewal. 7a5147c871d5630a1b39f4507bb10cf8b0719445 82 81 2018-07-17T04:15:00Z Weiler 3 wikitext text/x-wiki If you need VPN access to the Genomics Institute firewalled/secure area (aka the "Prism" Environment), please make an appointment with the GI SysAdmin team by emailing ''cluster-admin@soe.ucsc.edu'' requesting access. There are several requirements to gaining access to the firewalled area - please complete all these requirements '''BEFORE''' coming to have the VPN software set up for your laptop. Please use this checklist to make sure that you have complete all '''six''' requirements. '''1'''. User info, your PI info and your PI's approval '''2'''. NIH Public Security Refresher Course Certificate '''3'''. Signed Genomics Institute VPN User Agreement '''4'''. Signed NIH Genomic Data Sharing Policy Agreement '''5'''. "eduroam" wireless network has setup on your laptop '''6'''. Installed the appropriate OpenVPN software on your laptop '''1''': You are required to ask your PI or sponsor to email cluster-admin@soe.ucsc.edu requesting a VPN account for you - this email should include: Your name Your PI's name Your requested username (If your name is Jane Doe, then your username could be 'jdoe' for example). PI's approval for this access What other access you need such as a UNIX server account or access to OpenStack. '''2''': You must take the NIH Public Security Refresher Course online, then print out the Completion Certificate (which should have your name on it) at the end of the training and deliver it when you come to your appointment to install the VPN software: https://irtsectraining.nih.gov/publicUser.aspx Click on the "2018 Information Security, Counterintelligence, Privacy Awareness, Records Management Refresher" link to begin the course. At the end you will be able to print out the completion certificate that should have your name on it. '''3''': You need to print and sign the Genomics Institute VPN User Agreement and bring it with you to your VPN software installation appointment, located here for download: [[Media:GI_VPN_Policy.pdf]] '''4''': Please print, read and sign the last page of the NIH Genomic Data Sharing Policy agreement, located here for download. Just staple the pages together. Please bring the signed document to your appointment. By signing the document you agree that you have read and understand the policies described therein and that you agree to abide by those policies: [[Media:NIH_GDS_Policy.pdf]] '''5''': You will need access to the "eduroam" wireless network '''prior''' to your appointment. Other UCSC wireless networks such as "cruznet" will not work with our VPN software, so please make sure your laptop works with eduroam before coming to your appointment. Instructions on how to get on eduroam are detailed here: https://its.ucsc.edu/wireless/eduroam.html When using the VPN software off-campus, it will usually work unless the wireless network you are on has restrictions preventing it from functioning. Some other universities have such restrictions (notably UCSF), but most other wireless network and home wireless networks should work fine. '''6''': Before your appointment, please make sure you install the appropriate OpenVPN software on your laptop: A laptop running OS X, Windows or Ubuntu For Macs, please download and install '''Tunnelblick''' from https://tunnelblick.net/downloads.html. Select the Latest Stable version. For Windows, please download and install '''OpenVPN Client''' from https://openvpn.net/index.php/open-source/downloads.html. Select ''openvpn-install-x.x.x-xxxx.exe'' For Ubuntu, please install network-manager-openvpn by typing: sudo apt-get install network-manager-openvpn network-manager-openvpn-gnome Please do NOT worry about how to configure the software at this point. We will help you to set it up at your appointment. We will correspond with you via email on when the appointment will be. The appointment can take up to 30 minutes per person depending on whether or not any issues come up during the software setup. If you show up for your appointment without one (or more) of the above outlined requirements, we will have to reschedule your appointment for a time when you can arrive after completing the above requirements. '''PLEASE NOTE:''' Because of the overhead required in setting up VPN access, please only request access if you have an immediate need to work on data that exists behind the firewall. We have had a decent number of people request access and go through the setup but then never use it. In other words, please do not request access because "one day you might need it", but because you '''do''' actually need it! '''ALSO NOTE:''' VPN accounts expire after one year from the date of first gaining access. To renew for another year you will need your PI/sponsor to send us a note asking for renewal. 6b0477cef9231e24b18bdf5c8d66f554e74aceed 83 82 2018-07-17T04:15:33Z Weiler 3 wikitext text/x-wiki If you need VPN access to the Genomics Institute firewalled/secure area (aka the "Prism" Environment), please make an appointment with the GI SysAdmin team by emailing ''cluster-admin@soe.ucsc.edu'' requesting access. There are several requirements to gaining access to the firewalled area - please complete all these requirements '''BEFORE''' coming to have the VPN software set up for your laptop. Please use this checklist to make sure that you have completed all '''six''' requirements. '''1'''. User info, your PI info and your PI's approval '''2'''. NIH Public Security Refresher Course Certificate '''3'''. Signed Genomics Institute VPN User Agreement '''4'''. Signed NIH Genomic Data Sharing Policy Agreement '''5'''. "eduroam" wireless network has setup on your laptop '''6'''. Installed the appropriate OpenVPN software on your laptop '''1''': You are required to ask your PI or sponsor to email cluster-admin@soe.ucsc.edu requesting a VPN account for you - this email should include: Your name Your PI's name Your requested username (If your name is Jane Doe, then your username could be 'jdoe' for example). PI's approval for this access What other access you need such as a UNIX server account or access to OpenStack. '''2''': You must take the NIH Public Security Refresher Course online, then print out the Completion Certificate (which should have your name on it) at the end of the training and deliver it when you come to your appointment to install the VPN software: https://irtsectraining.nih.gov/publicUser.aspx Click on the "2018 Information Security, Counterintelligence, Privacy Awareness, Records Management Refresher" link to begin the course. At the end you will be able to print out the completion certificate that should have your name on it. '''3''': You need to print and sign the Genomics Institute VPN User Agreement and bring it with you to your VPN software installation appointment, located here for download: [[Media:GI_VPN_Policy.pdf]] '''4''': Please print, read and sign the last page of the NIH Genomic Data Sharing Policy agreement, located here for download. Just staple the pages together. Please bring the signed document to your appointment. By signing the document you agree that you have read and understand the policies described therein and that you agree to abide by those policies: [[Media:NIH_GDS_Policy.pdf]] '''5''': You will need access to the "eduroam" wireless network '''prior''' to your appointment. Other UCSC wireless networks such as "cruznet" will not work with our VPN software, so please make sure your laptop works with eduroam before coming to your appointment. Instructions on how to get on eduroam are detailed here: https://its.ucsc.edu/wireless/eduroam.html When using the VPN software off-campus, it will usually work unless the wireless network you are on has restrictions preventing it from functioning. Some other universities have such restrictions (notably UCSF), but most other wireless network and home wireless networks should work fine. '''6''': Before your appointment, please make sure you install the appropriate OpenVPN software on your laptop: A laptop running OS X, Windows or Ubuntu For Macs, please download and install '''Tunnelblick''' from https://tunnelblick.net/downloads.html. Select the Latest Stable version. For Windows, please download and install '''OpenVPN Client''' from https://openvpn.net/index.php/open-source/downloads.html. Select ''openvpn-install-x.x.x-xxxx.exe'' For Ubuntu, please install network-manager-openvpn by typing: sudo apt-get install network-manager-openvpn network-manager-openvpn-gnome Please do NOT worry about how to configure the software at this point. We will help you to set it up at your appointment. We will correspond with you via email on when the appointment will be. The appointment can take up to 30 minutes per person depending on whether or not any issues come up during the software setup. If you show up for your appointment without one (or more) of the above outlined requirements, we will have to reschedule your appointment for a time when you can arrive after completing the above requirements. '''PLEASE NOTE:''' Because of the overhead required in setting up VPN access, please only request access if you have an immediate need to work on data that exists behind the firewall. We have had a decent number of people request access and go through the setup but then never use it. In other words, please do not request access because "one day you might need it", but because you '''do''' actually need it! '''ALSO NOTE:''' VPN accounts expire after one year from the date of first gaining access. To renew for another year you will need your PI/sponsor to send us a note asking for renewal. 672c75cf7959d15416011fa2bd5765ca74122d15 88 83 2018-07-24T20:47:45Z Weiler 3 wikitext text/x-wiki If you need VPN access to the Genomics Institute firewalled/secure area (aka the "Prism" Environment), please make an appointment with the GI SysAdmin team by emailing ''cluster-admin@soe.ucsc.edu'' requesting access. There are several requirements to gaining access to the firewalled area - please complete all these requirements '''BEFORE''' coming to have the VPN software set up for your laptop. Please use this checklist to make sure that you have completed all '''six''' requirements. '''1'''. User info, your PI info and your PI's approval '''2'''. NIH Public Security Refresher Course Certificate '''3'''. Signed Genomics Institute VPN User Agreement '''4'''. Signed NIH Genomic Data Sharing Policy Agreement '''5'''. "eduroam" wireless network has setup on your laptop '''6'''. Installed the appropriate OpenVPN software on your laptop '''1''': You are required to ask your PI or sponsor to email cluster-admin@soe.ucsc.edu requesting a VPN account for you - this email should include: Your name Your PI's name Your requested username (If your name is Jane Doe, then your username could be 'jdoe' for example). PI's approval for this access What other access you need such as a UNIX server account or access to OpenStack. '''2''': You must take the NIH Public Security Refresher Course online, then print out the Completion Certificate (which should have your name on it) at the end of the training and deliver it when you come to your appointment to install the VPN software. You must complete the course in a single continuous sitting in order to be able to print the certificate at the end: https://irtsectraining.nih.gov/publicUser.aspx Click on the "2018 Information Security, Counterintelligence, Privacy Awareness, Records Management Refresher" link to begin the course. At the end you will be able to print out the completion certificate that should have your name on it. '''3''': You need to print and sign the Genomics Institute VPN User Agreement and bring it with you to your VPN software installation appointment, located here for download: [[Media:GI_VPN_Policy.pdf]] '''4''': Please print, read and sign the last page of the NIH Genomic Data Sharing Policy agreement, located here for download. Just staple the pages together. Please bring the signed document to your appointment. By signing the document you agree that you have read and understand the policies described therein and that you agree to abide by those policies: [[Media:NIH_GDS_Policy.pdf]] '''5''': You will need access to the "eduroam" wireless network '''prior''' to your appointment. Other UCSC wireless networks such as "cruznet" will not work with our VPN software, so please make sure your laptop works with eduroam before coming to your appointment. Instructions on how to get on eduroam are detailed here: https://its.ucsc.edu/wireless/eduroam.html When using the VPN software off-campus, it will usually work unless the wireless network you are on has restrictions preventing it from functioning. Some other universities have such restrictions (notably UCSF), but most other wireless network and home wireless networks should work fine. '''6''': Before your appointment, please make sure you install the appropriate OpenVPN software on your laptop: A laptop running OS X, Windows or Ubuntu For Macs, please download and install '''Tunnelblick''' from https://tunnelblick.net/downloads.html. Select the Latest Stable version. For Windows, please download and install '''OpenVPN Client''' from https://openvpn.net/index.php/open-source/downloads.html. Select ''openvpn-install-x.x.x-xxxx.exe'' For Ubuntu, please install network-manager-openvpn by typing: sudo apt-get install network-manager-openvpn network-manager-openvpn-gnome Please do NOT worry about how to configure the software at this point. We will help you to set it up at your appointment. We will correspond with you via email on when the appointment will be. The appointment can take up to 30 minutes per person depending on whether or not any issues come up during the software setup. If you show up for your appointment without one (or more) of the above outlined requirements, we will have to reschedule your appointment for a time when you can arrive after completing the above requirements. '''PLEASE NOTE:''' Because of the overhead required in setting up VPN access, please only request access if you have an immediate need to work on data that exists behind the firewall. We have had a decent number of people request access and go through the setup but then never use it. In other words, please do not request access because "one day you might need it", but because you '''do''' actually need it! '''ALSO NOTE:''' VPN accounts expire after one year from the date of first gaining access. To renew for another year you will need your PI/sponsor to send us a note asking for renewal. a411d3b421b35dd14f76cd58dbf6f3765243fbb5 How to access the public servers 0 11 58 34 2018-07-06T22:38:22Z Haifang 1 wikitext text/x-wiki The Genomic Institute public server is courtyard.gi.ucsc.edu To access the server, first request an account by emailing cluster-admin@soe.ucsc.edu. If you already have an account on kolossus, please provide your full name, your username on the server kolossus and the name of your PI and the lab or project you are working with. If you are affiliated with Genomic Institute, but do not have an account on kolossus, please make an appointment with the SysAdmin to setup your account in person. You can also request to be added to a group. A group for a PI is named as ''PI's_lastname_lab'' or the name of a project such as ''treehouse''. After you get the account, you can login by typing: ssh ''your_username''@courtyard.gi.ucsc.edu Your home directory path is /public/home/''your_username''. Your home directory quota is 30GB. A group's directory's path is /public/groups/''name_of_the_group'' You can create your directory under there. You share the disk space with your group mates. The quota for your group is 10TB. You can use '''''rsync''''' to copy files from kolossus.sdsc.edu to courtyard.gi.ucsc.edu. If needed, you can read about '''''rsync''''' [https://linux.die.net/man/1/rsync here]. If you want to setup a web page on courtyard, do this: mkdir /public/home/''your_username''/public_html chmod 755 /public/home/''your_username''/public_html put data in the public_html directory. The URL will be: http://public.gi.ucsc.edu/~''username''/ c1618366c5c9613b40161e613e2423e8fa08dd12 59 58 2018-07-06T22:39:54Z Haifang 1 wikitext text/x-wiki The Genomic Institute public server is courtyard.gi.ucsc.edu To access the server, first request an account by emailing cluster-admin@soe.ucsc.edu. If you already have an account on kolossus, please provide your full name, your username on the server kolossus and the name of your PI and the lab or project you are working with. If you are affiliated with Genomic Institute, but do not have an account on kolossus, please make an appointment with the SysAdmin to setup your account in person. You can also request to be added to a group. A group for a PI is named as ''PI's_lastname_lab'' or the name of a project such as ''treehouse''. After you get the account, you can login by typing: ssh ''your_username''@courtyard.gi.ucsc.edu Your home directory path is /public/home/''your_username''. Your home directory quota is 30GB. A group's directory's path is /public/groups/''name_of_the_group'' You can create your directory under there. You share the disk space with your group mates. The quota for your group is 10TB. You can use '''''rsync''''' to copy files from kolossus.sdsc.edu to courtyard.gi.ucsc.edu. If needed, you can read about '''''rsync''''' [https://linux.die.net/man/1/rsync here]. If you want to setup a web page on courtyard, do this: mkdir /public/home/''your_username''/public_html chmod 755 /public/home/''your_username''/public_html put data in the public_html directory. The URL will be: http://public.gi.ucsc.edu/''~username''/ fcfb028afb04a0fa9f25709e0c2e298a371043f1 64 59 2018-07-13T00:50:58Z Weiler 3 wikitext text/x-wiki The Genomic Institute public server is courtyard.gi.ucsc.edu To access the server, first request an account by emailing cluster-admin@soe.ucsc.edu. If you already have an account on kolossus, please provide your full name, your username on the server kolossus and the name of your PI and the lab or project you are working with. If you are affiliated with Genomic Institute, but do not have an account on kolossus, please make an appointment with the SysAdmin to setup your account in person. You can also request to be added to a group. A group for a PI is named as ''PI's_lastname_lab'' or the name of a project such as ''treehouse''. After you get the account, you can login by typing: ssh ''your_username''@courtyard.gi.ucsc.edu Your home directory path is /public/home/''your_username''. Your home directory quota is 30GB. A group's directory's path is /public/groups/''name_of_the_group'' You can create your directory under there. You share the disk space with your group mates. The quota for your group is 15TB. You can use '''''rsync''''' to copy files from kolossus.sdsc.edu to courtyard.gi.ucsc.edu. If needed, you can read about '''''rsync''''' [https://linux.die.net/man/1/rsync here]. If you want to setup a web page on courtyard, do this: mkdir /public/home/''your_username''/public_html chmod 755 /public/home/''your_username''/public_html put data in the public_html directory. The URL will be: http://public.gi.ucsc.edu/''~username''/ 3c7a110a92b296eec48ccfed25ac132776ea99a7 69 64 2018-07-13T01:25:05Z Weiler 3 wikitext text/x-wiki == Server Types and Management== You can ssh into our public computer server via SSH: '''courtyard.gi.ucsc.edu''': 1TB RAM, 64 cores, 672GBTB local scratch space This server is running CentOS 7.5 Linux. It is managed by the Genomics Institute Cluster Admin group. If you need software installed on it, please make your request by emailing cluster-admin@soe.ucsc.edu. == Storage == These servers mount two types of storage; home directories and group storage directories. Your home directory will be located as "/public/home/username" and has a 30GB quota. The group storage directories are created per PI, and each group directory has a 15TB quota. For example, if David Haussler is the PI that you report to directly, then the directory would exist as /public/groups/hausslerlab. Request access to that group directory and you will then be able to write to it. Each of those group directories are shared by the lab it belongs to, so you must be wary of everyone's data usage and share the 15TB available per group accordingly. == Actually Doing Work and Computing == When doing research, running jobs and the like, please be careful of your resource consumption on the server you are on. Don't run too many threads or cores at once if such a thing overruns the RAM available or the disk IO available. If you are not sure of your potential RAM, CPU or disk impact, start small with one or two processes and work your way up from there. Also, before running your stuff, check what else is already happening on the server by using the 'top' command to see who else and what else is running and what kind of resources are already being consumed. If, after starting a process, you realize that the server slows down considerably or becomes unusable, kill your processes and re-evaluate what you need to make things work. These servers are shared resources - be a good neighbor! == Serving Files to the Public via the Web == If you want to setup a web page on courtyard, or serve files over HTTP from there, do this: mkdir /public/home/''your_username''/public_html chmod 755 /public/home/''your_username''/public_html Put data in the public_html directory. The URL will be: http://public.gi.ucsc.edu/''~username''/ 7dfab89fadefe8e9b42c148605273f503804c98c 84 69 2018-07-20T20:25:49Z Weiler 3 wikitext text/x-wiki == How to Gain Access to the Public Genomics Institute Compute Servers == If you need access to the Genomics Institute compute servers, please ask your PI or sponsor to email 'cluster-admin@soe.ucsc.edu' requesting that you be granted access. Then we can set up a quick meeting to create your account and go over the details. == Server Types and Management== You can ssh into our public computer server via SSH: '''courtyard.gi.ucsc.edu''': 1TB RAM, 64 cores, 672GBTB local scratch space This server is running CentOS 7.5 Linux. It is managed by the Genomics Institute Cluster Admin group. If you need software installed on it, please make your request by emailing cluster-admin@soe.ucsc.edu. == Storage == These servers mount two types of storage; home directories and group storage directories. Your home directory will be located as "/public/home/username" and has a 30GB quota. The group storage directories are created per PI, and each group directory has a 15TB quota. For example, if David Haussler is the PI that you report to directly, then the directory would exist as /public/groups/hausslerlab. Request access to that group directory and you will then be able to write to it. Each of those group directories are shared by the lab it belongs to, so you must be wary of everyone's data usage and share the 15TB available per group accordingly. == Actually Doing Work and Computing == When doing research, running jobs and the like, please be careful of your resource consumption on the server you are on. Don't run too many threads or cores at once if such a thing overruns the RAM available or the disk IO available. If you are not sure of your potential RAM, CPU or disk impact, start small with one or two processes and work your way up from there. Also, before running your stuff, check what else is already happening on the server by using the 'top' command to see who else and what else is running and what kind of resources are already being consumed. If, after starting a process, you realize that the server slows down considerably or becomes unusable, kill your processes and re-evaluate what you need to make things work. These servers are shared resources - be a good neighbor! == Serving Files to the Public via the Web == If you want to setup a web page on courtyard, or serve files over HTTP from there, do this: mkdir /public/home/''your_username''/public_html chmod 755 /public/home/''your_username''/public_html Put data in the public_html directory. The URL will be: http://public.gi.ucsc.edu/''~username''/ b527f3f746cc7badcb5f9e3512743957ab3546db 85 84 2018-07-20T20:26:08Z Weiler 3 wikitext text/x-wiki == How to Gain Access to the Public Genomics Institute Compute Servers == If you need access to the Genomics Institute compute servers, please ask your PI or sponsor to email 'cluster-admin@soe.ucsc.edu' requesting that you be granted access. Then we can set up a quick meeting to create your account and go over the details. == Server Types and Management== You can ssh into our public computer server via SSH: '''courtyard.gi.ucsc.edu''': 1TB RAM, 64 cores, 672GB local scratch space This server is running CentOS 7.5 Linux. It is managed by the Genomics Institute Cluster Admin group. If you need software installed on it, please make your request by emailing cluster-admin@soe.ucsc.edu. == Storage == These servers mount two types of storage; home directories and group storage directories. Your home directory will be located as "/public/home/username" and has a 30GB quota. The group storage directories are created per PI, and each group directory has a 15TB quota. For example, if David Haussler is the PI that you report to directly, then the directory would exist as /public/groups/hausslerlab. Request access to that group directory and you will then be able to write to it. Each of those group directories are shared by the lab it belongs to, so you must be wary of everyone's data usage and share the 15TB available per group accordingly. == Actually Doing Work and Computing == When doing research, running jobs and the like, please be careful of your resource consumption on the server you are on. Don't run too many threads or cores at once if such a thing overruns the RAM available or the disk IO available. If you are not sure of your potential RAM, CPU or disk impact, start small with one or two processes and work your way up from there. Also, before running your stuff, check what else is already happening on the server by using the 'top' command to see who else and what else is running and what kind of resources are already being consumed. If, after starting a process, you realize that the server slows down considerably or becomes unusable, kill your processes and re-evaluate what you need to make things work. These servers are shared resources - be a good neighbor! == Serving Files to the Public via the Web == If you want to setup a web page on courtyard, or serve files over HTTP from there, do this: mkdir /public/home/''your_username''/public_html chmod 755 /public/home/''your_username''/public_html Put data in the public_html directory. The URL will be: http://public.gi.ucsc.edu/''~username''/ 4f6d0cfeab69dd58235df453256675bf37b9c1b6 89 85 2018-07-31T20:56:49Z Weiler 3 wikitext text/x-wiki == How to Gain Access to the Public Genomics Institute Compute Servers == If you need access to the Genomics Institute compute servers, please ask your PI or sponsor to email 'cluster-admin@soe.ucsc.edu' requesting that you be granted access. Then we can set up a quick meeting to create your account and go over the details. == Server Types and Management== You can ssh into our public computer server via SSH: '''courtyard.gi.ucsc.edu''': 1TB RAM, 64 cores, 672GB local scratch space This server is running CentOS 7.5 Linux. It is managed by the Genomics Institute Cluster Admin group. If you need software installed on it, please make your request by emailing cluster-admin@soe.ucsc.edu. == Storage == These servers mount two types of storage; home directories and group storage directories. Your home directory will be located as "/public/home/username" and has a 30GB quota. The group storage directories are created per PI, and each group directory has a 15TB quota. For example, if David Haussler is the PI that you report to directly, then the directory would exist as /public/groups/hausslerlab. Request access to that group directory and you will then be able to write to it. Each of those group directories are shared by the lab it belongs to, so you must be wary of everyone's data usage and share the 15TB available per group accordingly. == Actually Doing Work and Computing == When doing research, running jobs and the like, please be careful of your resource consumption on the server you are on. Don't run too many threads or cores at once if such a thing overruns the RAM available or the disk IO available. If you are not sure of your potential RAM, CPU or disk impact, start small with one or two processes and work your way up from there. Also, before running your stuff, check what else is already happening on the server by using the 'top' command to see who else and what else is running and what kind of resources are already being consumed. If, after starting a process, you realize that the server slows down considerably or becomes unusable, kill your processes and re-evaluate what you need to make things work. These servers are shared resources - be a good neighbor! == Serving Files to the Public via the Web == If you want to setup a web page on courtyard, or serve files over HTTP from there, do this: mkdir /public/home/''your_username''/public_html chmod 755 /public/home/''your_username''/public_html Put data in the public_html directory. The URL will be: http://public.gi.ucsc.edu/''~username''/ == /scratch Space on the Servers == Each server will generally have a local /scratch filesystem that you can use to store temporary files. '''BE ADVISED''' that /scratch is not backed up, and the data there could disappear in the event of a disk failure or anything else. Do not store important data there. If it is important, it should be moved somewhere else very soon after creation. 3ebf4b9a7668a4a46cf71230da7e6bb5e25def03 91 89 2018-08-03T22:03:49Z Weiler 3 wikitext text/x-wiki == How to Gain Access to the Public Genomics Institute Compute Servers == If you need access to the Genomics Institute compute servers, please ask your PI or sponsor to email 'cluster-admin@soe.ucsc.edu' requesting that you be granted access. Then we can set up a quick meeting to create your account and go over the details. == Server Types and Management== You can ssh into our public computers server via SSH: '''courtyard.gi.ucsc.edu''': 1TB RAM, 64 cores, 672GB local scratch space '''plaza.gi.ucsc.edu''': 1TB RAM, 64 cores, 672GB local scratch space These servers are running CentOS 7.5 Linux. They are managed by the Genomics Institute Cluster Admin group. If you need software installed on them, please make your request by emailing cluster-admin@soe.ucsc.edu. == Storage == These servers mount two types of storage; home directories and group storage directories. Your home directory will be located as "/public/home/username" and has a 30GB quota. The group storage directories are created per PI, and each group directory has a 15TB quota. For example, if David Haussler is the PI that you report to directly, then the directory would exist as /public/groups/hausslerlab. Request access to that group directory and you will then be able to write to it. Each of those group directories are shared by the lab it belongs to, so you must be wary of everyone's data usage and share the 15TB available per group accordingly. == Actually Doing Work and Computing == When doing research, running jobs and the like, please be careful of your resource consumption on the server you are on. Don't run too many threads or cores at once if such a thing overruns the RAM available or the disk IO available. If you are not sure of your potential RAM, CPU or disk impact, start small with one or two processes and work your way up from there. Also, before running your stuff, check what else is already happening on the server by using the 'top' command to see who else and what else is running and what kind of resources are already being consumed. If, after starting a process, you realize that the server slows down considerably or becomes unusable, kill your processes and re-evaluate what you need to make things work. These servers are shared resources - be a good neighbor! == Serving Files to the Public via the Web == If you want to setup a web page on courtyard, or serve files over HTTP from there, do this: mkdir /public/home/''your_username''/public_html chmod 755 /public/home/''your_username''/public_html Put data in the public_html directory. The URL will be: http://public.gi.ucsc.edu/''~username''/ == /scratch Space on the Servers == Each server will generally have a local /data/scratch filesystem that you can use to store temporary files. '''BE ADVISED''' that /data/scratch is not backed up, and the data there could disappear in the event of a disk failure or anything else. Do not store important data there. If it is important, it should be moved somewhere else very soon after creation. 490762d89796f20be68f8408538846f74fbf8883 92 91 2018-08-03T22:04:15Z Weiler 3 /* Server Types and Management */ wikitext text/x-wiki == How to Gain Access to the Public Genomics Institute Compute Servers == If you need access to the Genomics Institute compute servers, please ask your PI or sponsor to email 'cluster-admin@soe.ucsc.edu' requesting that you be granted access. Then we can set up a quick meeting to create your account and go over the details. == Server Types and Management== You can ssh into our public computers server via SSH: '''courtyard.gi.ucsc.edu''': 1TB RAM, 64 cores, 672GB local scratch space '''plaza.gi.ucsc.edu''': 1TB RAM, 64 cores, 672GB local scratch space These servers are running CentOS 7.5 Linux. They are managed by the Genomics Institute Cluster Admin group. If you need software installed on them, please make your request by emailing cluster-admin@soe.ucsc.edu. == Storage == These servers mount two types of storage; home directories and group storage directories. Your home directory will be located as "/public/home/username" and has a 30GB quota. The group storage directories are created per PI, and each group directory has a 15TB quota. For example, if David Haussler is the PI that you report to directly, then the directory would exist as /public/groups/hausslerlab. Request access to that group directory and you will then be able to write to it. Each of those group directories are shared by the lab it belongs to, so you must be wary of everyone's data usage and share the 15TB available per group accordingly. == Actually Doing Work and Computing == When doing research, running jobs and the like, please be careful of your resource consumption on the server you are on. Don't run too many threads or cores at once if such a thing overruns the RAM available or the disk IO available. If you are not sure of your potential RAM, CPU or disk impact, start small with one or two processes and work your way up from there. Also, before running your stuff, check what else is already happening on the server by using the 'top' command to see who else and what else is running and what kind of resources are already being consumed. If, after starting a process, you realize that the server slows down considerably or becomes unusable, kill your processes and re-evaluate what you need to make things work. These servers are shared resources - be a good neighbor! == Serving Files to the Public via the Web == If you want to setup a web page on courtyard, or serve files over HTTP from there, do this: mkdir /public/home/''your_username''/public_html chmod 755 /public/home/''your_username''/public_html Put data in the public_html directory. The URL will be: http://public.gi.ucsc.edu/''~username''/ == /scratch Space on the Servers == Each server will generally have a local /data/scratch filesystem that you can use to store temporary files. '''BE ADVISED''' that /data/scratch is not backed up, and the data there could disappear in the event of a disk failure or anything else. Do not store important data there. If it is important, it should be moved somewhere else very soon after creation. 40fe46b049e16a789b963dcc7b7365b2fad8b536 93 92 2018-08-05T03:36:49Z Weiler 3 wikitext text/x-wiki == How to Gain Access to the Public Genomics Institute Compute Servers == If you need access to the Genomics Institute compute servers, please ask your PI or sponsor to email 'cluster-admin@soe.ucsc.edu' requesting that you be granted access. Then we can set up a quick meeting to create your account and go over the details. == Server Types and Management== You can ssh into our public compute servers via SSH: '''courtyard.gi.ucsc.edu''': 1TB RAM, 64 cores, 672GB local scratch space '''plaza.gi.ucsc.edu''': 1TB RAM, 64 cores, 672GB local scratch space These servers are running CentOS 7.5 Linux. They are managed by the Genomics Institute Cluster Admin group. If you need software installed on them, please make your request by emailing cluster-admin@soe.ucsc.edu. == Storage == These servers mount two types of storage; home directories and group storage directories. Your home directory will be located as "/public/home/username" and has a 30GB quota. The group storage directories are created per PI, and each group directory has a 15TB quota. For example, if David Haussler is the PI that you report to directly, then the directory would exist as /public/groups/hausslerlab. Request access to that group directory and you will then be able to write to it. Each of those group directories are shared by the lab it belongs to, so you must be wary of everyone's data usage and share the 15TB available per group accordingly. == Actually Doing Work and Computing == When doing research, running jobs and the like, please be careful of your resource consumption on the server you are on. Don't run too many threads or cores at once if such a thing overruns the RAM available or the disk IO available. If you are not sure of your potential RAM, CPU or disk impact, start small with one or two processes and work your way up from there. Also, before running your stuff, check what else is already happening on the server by using the 'top' command to see who else and what else is running and what kind of resources are already being consumed. If, after starting a process, you realize that the server slows down considerably or becomes unusable, kill your processes and re-evaluate what you need to make things work. These servers are shared resources - be a good neighbor! == Serving Files to the Public via the Web == If you want to setup a web page on courtyard, or serve files over HTTP from there, do this: mkdir /public/home/''your_username''/public_html chmod 755 /public/home/''your_username''/public_html Put data in the public_html directory. The URL will be: http://public.gi.ucsc.edu/''~username''/ == /scratch Space on the Servers == Each server will generally have a local /data/scratch filesystem that you can use to store temporary files. '''BE ADVISED''' that /data/scratch is not backed up, and the data there could disappear in the event of a disk failure or anything else. Do not store important data there. If it is important, it should be moved somewhere else very soon after creation. a13fb108de87de45fcb9361e680241b2e0274570 94 93 2018-08-05T03:52:11Z Weiler 3 /* /scratch Space on the Servers */ wikitext text/x-wiki == How to Gain Access to the Public Genomics Institute Compute Servers == If you need access to the Genomics Institute compute servers, please ask your PI or sponsor to email 'cluster-admin@soe.ucsc.edu' requesting that you be granted access. Then we can set up a quick meeting to create your account and go over the details. == Server Types and Management== You can ssh into our public compute servers via SSH: '''courtyard.gi.ucsc.edu''': 1TB RAM, 64 cores, 672GB local scratch space '''plaza.gi.ucsc.edu''': 1TB RAM, 64 cores, 672GB local scratch space These servers are running CentOS 7.5 Linux. They are managed by the Genomics Institute Cluster Admin group. If you need software installed on them, please make your request by emailing cluster-admin@soe.ucsc.edu. == Storage == These servers mount two types of storage; home directories and group storage directories. Your home directory will be located as "/public/home/username" and has a 30GB quota. The group storage directories are created per PI, and each group directory has a 15TB quota. For example, if David Haussler is the PI that you report to directly, then the directory would exist as /public/groups/hausslerlab. Request access to that group directory and you will then be able to write to it. Each of those group directories are shared by the lab it belongs to, so you must be wary of everyone's data usage and share the 15TB available per group accordingly. == Actually Doing Work and Computing == When doing research, running jobs and the like, please be careful of your resource consumption on the server you are on. Don't run too many threads or cores at once if such a thing overruns the RAM available or the disk IO available. If you are not sure of your potential RAM, CPU or disk impact, start small with one or two processes and work your way up from there. Also, before running your stuff, check what else is already happening on the server by using the 'top' command to see who else and what else is running and what kind of resources are already being consumed. If, after starting a process, you realize that the server slows down considerably or becomes unusable, kill your processes and re-evaluate what you need to make things work. These servers are shared resources - be a good neighbor! == Serving Files to the Public via the Web == If you want to setup a web page on courtyard, or serve files over HTTP from there, do this: mkdir /public/home/''your_username''/public_html chmod 755 /public/home/''your_username''/public_html Put data in the public_html directory. The URL will be: http://public.gi.ucsc.edu/''~username''/ == /scratch Space on the Servers == Each server will generally have a local /scratch filesystem that you can use to store temporary files. '''BE ADVISED''' that /scratch is not backed up, and the data there could disappear in the event of a disk failure or anything else. Do not store important data there. If it is important, it should be moved somewhere else very soon after creation. 5b16062515e3b8d168e54d25d40c4a40fe0e0868 Requirement for users to get POD VPN access 0 16 63 2018-07-12T20:36:13Z Haifang 1 Created page with "If you need VPN access to POD, please make an appointment with the SysAdmin team by emailing ''cluster-admin@soe.ucsc.edu''. In this email please provide your name your P..." wikitext text/x-wiki If you need VPN access to POD, please make an appointment with the SysAdmin team by emailing ''cluster-admin@soe.ucsc.edu''. In this email please provide your name your PI's name PI's approval for this access (an email from your PI will be fine) what other access do you need such as an unix server account or to the OpenStack. Before your appointment, please make sure you have the following: A laptop running OS X, Windows or Ubuntu wireless connection to '''eduroam'''. cruznet cannot connect to the VPNs. You can find the instruction on how to setup '''eduroam''' at https://its.ucsc.edu/wireless/eduroam-config.html. '''For Macs''', please download and install '''Tunnelblick''' from https://tunnelblick.net/downloads.html. Select the Stable version. '''For Windows''', please download and install '''OpenVPN Client''' from https://openvpn.net/index.php/open-source/downloads.html. Select ''openvpn-install-x.x.x-xxxx.exe'' '''For Ubuntu''', please install network-manager-openvpn by typing: sudo apt-get install network-manager-openvpn if that fails, please also install the following: sudo apt-get install network-manager-openvpn-gnome Please do NOT worry about how to configure the software at this point. We will help you to set it up at your appointment. 99d84f5b55aa8251835b49455d2aa95faf46c17b Access to the Firewalled Compute Servers 0 17 66 2018-07-13T01:17:14Z Weiler 3 Created page with "Before you can access the firewalled environment (Prism), you must get VPN access to it, which is detailed here: [[How to access the public servers]] == Server Types and Man..." wikitext text/x-wiki Before you can access the firewalled environment (Prism), you must get VPN access to it, which is detailed here: [[How to access the public servers]] == Server Types and Management== After confirming your VPN software is working, you can ssh into one of the compute servers behind the VPN: '''crimson''': 256GB RAM, 32 cores, 5.5TB local scratch space '''razzmatazz''': 256GB RAM, 32 cores, 5.5TB local scratch space These servers are running CentOS 7.5 Linux. They are managed by the Genomics Institute Cluster Admin group. If you need software installed on either or both of these servers, please make your request by emailing cluster-admin@soe.ucsc.edu. == Storage == These servers mount two types of storage; home directories and group storage directories. Your home directory will be located as "/private/home/username" and has a 30GB quota. The group storage directories are created per PI, and each group directory has a 15TB quota. For example, if David Haussler is the PI that you report to directly, then the directory would exist as /private/groups/hausslerlab. Request access to that group directory and you will then be able to write to it. Each of those group directories are shared by the lab it belongs to, so you must be wary of everyone's data usage and share the 15TB available per group accordingly. == Actually Doing Work and Computing == When doing research, running jobs and the like, please be careful of your resource consumption on the server you are on. Don't run too many threads or cores at once if such a thing overruns the RAM available or the disk IO available. If you are not sure of your potential RAM, CPU or disk impact, start small with one or two processes and work your way up from there. Also, before running your stuff, check what else is already happening on the server by using the 'top' command to see who else and what else is running and what kind of resources are already being consumed. If, after starting a process, you realize that the server slows down considerably or becomes unusable, kill your processes and re-evaluate what you need to make things work. These servers are shared resources - be a good neighbor! 5b5e99443940b0544cd42d5bd2aa21a74a231166 67 66 2018-07-13T01:17:39Z Weiler 3 wikitext text/x-wiki Before you can access the firewalled environment (Prism), you must get VPN access to it, which is detailed here: [[How to access the public servers]] == Server Types and Management== After confirming your VPN software is working, you can ssh into one of the compute servers behind the VPN: '''crimson''': 256GB RAM, 32 cores, 5.5TB local scratch space '''razzmatazz''': 256GB RAM, 32 cores, 5.5TB local scratch space These servers are running CentOS 7.5 Linux. They are managed by the Genomics Institute Cluster Admin group. If you need software installed on either or both of these servers, please make your request by emailing cluster-admin@soe.ucsc.edu. == Storage == These servers mount two types of storage; home directories and group storage directories. Your home directory will be located as "/private/home/username" and has a 30GB quota. The group storage directories are created per PI, and each group directory has a 15TB quota. For example, if David Haussler is the PI that you report to directly, then the directory would exist as /private/groups/hausslerlab. Request access to that group directory and you will then be able to write to it. Each of those group directories are shared by the lab it belongs to, so you must be wary of everyone's data usage and share the 15TB available per group accordingly. == Actually Doing Work and Computing == When doing research, running jobs and the like, please be careful of your resource consumption on the server you are on. Don't run too many threads or cores at once if such a thing overruns the RAM available or the disk IO available. If you are not sure of your potential RAM, CPU or disk impact, start small with one or two processes and work your way up from there. Also, before running your stuff, check what else is already happening on the server by using the 'top' command to see who else and what else is running and what kind of resources are already being consumed. If, after starting a process, you realize that the server slows down considerably or becomes unusable, kill your processes and re-evaluate what you need to make things work. These servers are shared resources - be a good neighbor! f006432397784c13149ca2cbb7c7fbcb4869dc97 68 67 2018-07-13T01:18:11Z Weiler 3 wikitext text/x-wiki Before you can access the firewalled environment (Prism), you must get VPN access to it, which is detailed here: [[How to access the public servers]] == Server Types and Management== After confirming your VPN software is working, you can ssh into one of the compute servers behind the VPN: '''crimson.prism''': 256GB RAM, 32 cores, 5.5TB local scratch space '''razzmatazz.prism''': 256GB RAM, 32 cores, 5.5TB local scratch space These servers are running CentOS 7.5 Linux. They are managed by the Genomics Institute Cluster Admin group. If you need software installed on either or both of these servers, please make your request by emailing cluster-admin@soe.ucsc.edu. == Storage == These servers mount two types of storage; home directories and group storage directories. Your home directory will be located as "/private/home/username" and has a 30GB quota. The group storage directories are created per PI, and each group directory has a 15TB quota. For example, if David Haussler is the PI that you report to directly, then the directory would exist as /private/groups/hausslerlab. Request access to that group directory and you will then be able to write to it. Each of those group directories are shared by the lab it belongs to, so you must be wary of everyone's data usage and share the 15TB available per group accordingly. == Actually Doing Work and Computing == When doing research, running jobs and the like, please be careful of your resource consumption on the server you are on. Don't run too many threads or cores at once if such a thing overruns the RAM available or the disk IO available. If you are not sure of your potential RAM, CPU or disk impact, start small with one or two processes and work your way up from there. Also, before running your stuff, check what else is already happening on the server by using the 'top' command to see who else and what else is running and what kind of resources are already being consumed. If, after starting a process, you realize that the server slows down considerably or becomes unusable, kill your processes and re-evaluate what you need to make things work. These servers are shared resources - be a good neighbor! d76e16f7391160ed97280f5342c7386264493213 72 68 2018-07-13T01:27:08Z Weiler 3 wikitext text/x-wiki Before you can access the firewalled environment (Prism), you must get VPN access to it, which is detailed here: [[Requirement for users to get GI VPN access]] == Server Types and Management== After confirming your VPN software is working, you can ssh into one of the compute servers behind the VPN: '''crimson.prism''': 256GB RAM, 32 cores, 5.5TB local scratch space '''razzmatazz.prism''': 256GB RAM, 32 cores, 5.5TB local scratch space These servers are running CentOS 7.5 Linux. They are managed by the Genomics Institute Cluster Admin group. If you need software installed on either or both of these servers, please make your request by emailing cluster-admin@soe.ucsc.edu. == Storage == These servers mount two types of storage; home directories and group storage directories. Your home directory will be located as "/private/home/username" and has a 30GB quota. The group storage directories are created per PI, and each group directory has a 15TB quota. For example, if David Haussler is the PI that you report to directly, then the directory would exist as /private/groups/hausslerlab. Request access to that group directory and you will then be able to write to it. Each of those group directories are shared by the lab it belongs to, so you must be wary of everyone's data usage and share the 15TB available per group accordingly. == Actually Doing Work and Computing == When doing research, running jobs and the like, please be careful of your resource consumption on the server you are on. Don't run too many threads or cores at once if such a thing overruns the RAM available or the disk IO available. If you are not sure of your potential RAM, CPU or disk impact, start small with one or two processes and work your way up from there. Also, before running your stuff, check what else is already happening on the server by using the 'top' command to see who else and what else is running and what kind of resources are already being consumed. If, after starting a process, you realize that the server slows down considerably or becomes unusable, kill your processes and re-evaluate what you need to make things work. These servers are shared resources - be a good neighbor! 25c985b8c78d1663f7be005146d1b4c9e2fe7a08 77 72 2018-07-14T15:48:42Z Weiler 3 wikitext text/x-wiki Before you can access the firewalled environment (Prism), you must get VPN access to it, which is detailed here: [[Requirement for users to get GI VPN access]] == Server Types and Management== After confirming your VPN software is working, you can ssh into one of the compute servers behind the VPN: '''crimson.prism''': 256GB RAM, 32 cores, 5.5TB local scratch space '''razzmatazz.prism''': 256GB RAM, 32 cores, 5.5TB local scratch space These servers are running CentOS 7.5 Linux. They are managed by the Genomics Institute Cluster Admin group. If you need software installed on either or both of these servers, please make your request by emailing cluster-admin@soe.ucsc.edu. We will add another compute server later on that will have 1TB RAM, 64 cores and several TB of local scratch, but not for a while. == Storage == These servers mount two types of storage; home directories and group storage directories. Your home directory will be located as "/private/home/username" and has a 30GB quota. The group storage directories are created per PI, and each group directory has a 15TB quota. For example, if David Haussler is the PI that you report to directly, then the directory would exist as /private/groups/hausslerlab. Request access to that group directory and you will then be able to write to it. Each of those group directories are shared by the lab it belongs to, so you must be wary of everyone's data usage and share the 15TB available per group accordingly. == Actually Doing Work and Computing == When doing research, running jobs and the like, please be careful of your resource consumption on the server you are on. Don't run too many threads or cores at once if such a thing overruns the RAM available or the disk IO available. If you are not sure of your potential RAM, CPU or disk impact, start small with one or two processes and work your way up from there. Also, before running your stuff, check what else is already happening on the server by using the 'top' command to see who else and what else is running and what kind of resources are already being consumed. If, after starting a process, you realize that the server slows down considerably or becomes unusable, kill your processes and re-evaluate what you need to make things work. These servers are shared resources - be a good neighbor! == The Firewall == All servers are behind a firewall in this environment, and as such, you must connect to the VPN in order to access them. They will not be accessible from the greater Internet without VPN. Although you will be able to connect outbound from them to other servers on the internet to copy data in, sync git repos, stuff like that. It is only inbound connections that will be blocked. All machines behind the firewall have the private domain name suffix of "*.prism". c8ba410fbfec4e9b05f3d3af0999675f5a1ed24d 90 77 2018-07-31T23:01:51Z Weiler 3 wikitext text/x-wiki Before you can access the firewalled environment (Prism), you must get VPN access to it, which is detailed here: [[Requirement for users to get GI VPN access]] == Server Types and Management== After confirming your VPN software is working, you can ssh into one of the compute servers behind the VPN: '''crimson.prism''': 256GB RAM, 32 cores, 5.5TB local scratch space '''razzmatazz.prism''': 256GB RAM, 32 cores, 5.5TB local scratch space These servers are running CentOS 7.5 Linux. They are managed by the Genomics Institute Cluster Admin group. If you need software installed on either or both of these servers, please make your request by emailing cluster-admin@soe.ucsc.edu. We will add another compute server later on that will have 1TB RAM, 64 cores and several TB of local scratch, but not for a while. == Storage == These servers mount two types of storage; home directories and group storage directories. Your home directory will be located as "/private/home/username" and has a 30GB quota. The group storage directories are created per PI, and each group directory has a 15TB quota. For example, if David Haussler is the PI that you report to directly, then the directory would exist as /private/groups/hausslerlab. Request access to that group directory and you will then be able to write to it. Each of those group directories are shared by the lab it belongs to, so you must be wary of everyone's data usage and share the 15TB available per group accordingly. == Actually Doing Work and Computing == When doing research, running jobs and the like, please be careful of your resource consumption on the server you are on. Don't run too many threads or cores at once if such a thing overruns the RAM available or the disk IO available. If you are not sure of your potential RAM, CPU or disk impact, start small with one or two processes and work your way up from there. Also, before running your stuff, check what else is already happening on the server by using the 'top' command to see who else and what else is running and what kind of resources are already being consumed. If, after starting a process, you realize that the server slows down considerably or becomes unusable, kill your processes and re-evaluate what you need to make things work. These servers are shared resources - be a good neighbor! == The Firewall == All servers are behind a firewall in this environment, and as such, you must connect to the VPN in order to access them. They will not be accessible from the greater Internet without VPN. Although you will be able to connect outbound from them to other servers on the internet to copy data in, sync git repos, stuff like that. It is only inbound connections that will be blocked. All machines behind the firewall have the private domain name suffix of "*.prism". == /scratch Space on the Servers == Each server will generally have a local /scratch filesystem that you can use to store temporary files. '''BE ADVISED''' that /scratch is not backed up, and the data there could disappear in the event of a disk failure or anything else. Do not store important data there. If it is important, it should be moved somewhere else very soon after creation. b56b69c466a8963da725963a0fa995f5835553c2 95 90 2018-08-09T15:59:32Z Weiler 3 wikitext text/x-wiki Before you can access the firewalled environment (Prism), you must get VPN access to it, which is detailed here: [[Requirement for users to get GI VPN access]] == Server Types and Management== After confirming your VPN software is working, you can ssh into one of the compute servers behind the VPN: '''crimson.prism''': 256GB RAM, 32 cores, 5.5TB local scratch space '''razzmatazz.prism''': 256GB RAM, 32 cores, 5.5TB local scratch space '''mustard.prism''': 1.5TB RAM, 160 cores, 9TB local scratch space These servers are running CentOS 7.5 Linux. They are managed by the Genomics Institute Cluster Admin group. If you need software installed on either or both of these servers, please make your request by emailing cluster-admin@soe.ucsc.edu. We will add another compute server later on that will have 1TB RAM, 64 cores and several TB of local scratch, but not for a while. == Storage == These servers mount two types of storage; home directories and group storage directories. Your home directory will be located as "/private/home/username" and has a 30GB quota. The group storage directories are created per PI, and each group directory has a 15TB quota. For example, if David Haussler is the PI that you report to directly, then the directory would exist as /private/groups/hausslerlab. Request access to that group directory and you will then be able to write to it. Each of those group directories are shared by the lab it belongs to, so you must be wary of everyone's data usage and share the 15TB available per group accordingly. == Actually Doing Work and Computing == When doing research, running jobs and the like, please be careful of your resource consumption on the server you are on. Don't run too many threads or cores at once if such a thing overruns the RAM available or the disk IO available. If you are not sure of your potential RAM, CPU or disk impact, start small with one or two processes and work your way up from there. Also, before running your stuff, check what else is already happening on the server by using the 'top' command to see who else and what else is running and what kind of resources are already being consumed. If, after starting a process, you realize that the server slows down considerably or becomes unusable, kill your processes and re-evaluate what you need to make things work. These servers are shared resources - be a good neighbor! == The Firewall == All servers are behind a firewall in this environment, and as such, you must connect to the VPN in order to access them. They will not be accessible from the greater Internet without VPN. Although you will be able to connect outbound from them to other servers on the internet to copy data in, sync git repos, stuff like that. It is only inbound connections that will be blocked. All machines behind the firewall have the private domain name suffix of "*.prism". == /scratch Space on the Servers == Each server will generally have a local /scratch filesystem that you can use to store temporary files. '''BE ADVISED''' that /scratch is not backed up, and the data there could disappear in the event of a disk failure or anything else. Do not store important data there. If it is important, it should be moved somewhere else very soon after creation. 4de8fc4ad6e5246532d9599c7b4ea5e8faede7a0 96 95 2018-08-10T21:12:57Z Weiler 3 /* Server Types and Management */ wikitext text/x-wiki Before you can access the firewalled environment (Prism), you must get VPN access to it, which is detailed here: [[Requirement for users to get GI VPN access]] == Server Types and Management== After confirming your VPN software is working, you can ssh into one of the compute servers behind the VPN: '''crimson.prism''': 256GB RAM, 32 cores, 5.5TB local scratch space '''razzmatazz.prism''': 256GB RAM, 32 cores, 5.5TB local scratch space '''mustard.prism''': 1.5TB RAM, 160 cores, 9TB local scratch space These servers are running CentOS 7.5 Linux. They are managed by the Genomics Institute Cluster Admin group. If you need software installed on either or both of these servers, please make your request by emailing cluster-admin@soe.ucsc.edu. == Storage == These servers mount two types of storage; home directories and group storage directories. Your home directory will be located as "/private/home/username" and has a 30GB quota. The group storage directories are created per PI, and each group directory has a 15TB quota. For example, if David Haussler is the PI that you report to directly, then the directory would exist as /private/groups/hausslerlab. Request access to that group directory and you will then be able to write to it. Each of those group directories are shared by the lab it belongs to, so you must be wary of everyone's data usage and share the 15TB available per group accordingly. == Actually Doing Work and Computing == When doing research, running jobs and the like, please be careful of your resource consumption on the server you are on. Don't run too many threads or cores at once if such a thing overruns the RAM available or the disk IO available. If you are not sure of your potential RAM, CPU or disk impact, start small with one or two processes and work your way up from there. Also, before running your stuff, check what else is already happening on the server by using the 'top' command to see who else and what else is running and what kind of resources are already being consumed. If, after starting a process, you realize that the server slows down considerably or becomes unusable, kill your processes and re-evaluate what you need to make things work. These servers are shared resources - be a good neighbor! == The Firewall == All servers are behind a firewall in this environment, and as such, you must connect to the VPN in order to access them. They will not be accessible from the greater Internet without VPN. Although you will be able to connect outbound from them to other servers on the internet to copy data in, sync git repos, stuff like that. It is only inbound connections that will be blocked. All machines behind the firewall have the private domain name suffix of "*.prism". == /scratch Space on the Servers == Each server will generally have a local /scratch filesystem that you can use to store temporary files. '''BE ADVISED''' that /scratch is not backed up, and the data there could disappear in the event of a disk failure or anything else. Do not store important data there. If it is important, it should be moved somewhere else very soon after creation. d2232a217d2b7758ec166fe4c0f513bafdaee6c9 MediaWiki:Mainpage 8 18 75 2018-07-13T01:37:59Z Weiler 3 Created page with "Genomics Institute Computing Information" wikitext text/x-wiki Genomics Institute Computing Information 8f0db61c10b172b3645b34f3bf4d089ea852dcd1 Requirements for dbGaP Access 0 19 99 2018-08-27T16:41:23Z Weiler 3 Created page with "If you need NIH dbGaP access, there are several requirements to gaining access - please complete all these requirements '''BEFORE''' requesting dbGaP credentials. NOTE: If yo..." wikitext text/x-wiki If you need NIH dbGaP access, there are several requirements to gaining access - please complete all these requirements '''BEFORE''' requesting dbGaP credentials. NOTE: If you already have GI VPN access to the GI "Prism" Environment, then you have already completed the requirements detailed below - let Rochelle know and we can quickly move to getting you set up. Please use this checklist to make sure that you have completed all '''three''' requirements. '''1'''. Your PI's info and your PI's approval '''2'''. NIH Public Security Refresher Course Certificate '''3'''. Signed NIH Genomic Data Sharing Policy Agreement '''1''': You are required to ask your PI or sponsor to email Rochelle Fuller (hrfuller@ucsc.edu) requesting dbGaP access for you - this email should include: Your name Your PI's name PI's approval for this access '''2''': You must take the NIH Public Security Refresher Course online, then print out the Completion Certificate (which should have your name on it) at the end of the training and deliver it to Rochelle Fuller. You must complete the course in a single continuous sitting in order to be able to print the certificate at the end: https://irtsectraining.nih.gov/publicUser.aspx Click on the "2018 Information Security, Counterintelligence, Privacy Awareness, Records Management Refresher" link to begin the course. At the end you will be able to print out the completion certificate that should have your name on it. '''3''': Please print, read and sign the last page of the NIH Genomic Data Sharing Policy agreement, located here for download. Just staple the pages together. Please bring the signed document to your appointment. By signing the document you agree that you have read and understand the policies described therein and that you agree to abide by those policies: [[Media:NIH_GDS_Policy.pdf]] We will correspond with you via email on when the appointment will be - correspond with Rochelle about getting everything set up! (hrfuller@ucsc.edu) 218eb73401754b2e05eac643dc1fc5496f5eeaca 100 99 2018-08-27T16:41:58Z Weiler 3 wikitext text/x-wiki If you need NIH dbGaP access, there are several requirements to gaining access - please complete all these requirements '''BEFORE''' requesting dbGaP credentials. NOTE: If you already have GI VPN access to the GI "Prism" Environment, then you have already completed the requirements detailed below - let Rochelle know and we can quickly move to getting you set up. Please use this checklist to make sure that you have completed all '''three''' requirements. '''1'''. Your PI's info and your PI's approval '''2'''. NIH Public Security Refresher Course Certificate '''3'''. Signed NIH Genomic Data Sharing Policy Agreement '''1''': You are required to ask your PI or sponsor to email '''Rochelle Fuller (hrfuller@ucsc.edu)''' requesting dbGaP access for you - this email should include: Your name Your PI's name PI's approval for this access '''2''': You must take the NIH Public Security Refresher Course online, then print out the Completion Certificate (which should have your name on it) at the end of the training and deliver it to Rochelle Fuller. You must complete the course in a single continuous sitting in order to be able to print the certificate at the end: https://irtsectraining.nih.gov/publicUser.aspx Click on the "2018 Information Security, Counterintelligence, Privacy Awareness, Records Management Refresher" link to begin the course. At the end you will be able to print out the completion certificate that should have your name on it. '''3''': Please print, read and sign the last page of the NIH Genomic Data Sharing Policy agreement, located here for download. Just staple the pages together. Please bring the signed document to your appointment. By signing the document you agree that you have read and understand the policies described therein and that you agree to abide by those policies: [[Media:NIH_GDS_Policy.pdf]] We will correspond with you via email on when the appointment will be - correspond with Rochelle about getting everything set up! ('''hrfuller@ucsc.edu''') fef167db1b5ba1278c31babd35b585735d48cb32 Genomics Institute Computing Information 0 6 102 101 2018-09-19T16:13:19Z Weiler 3 wikitext text/x-wiki Welcome to the Genomic Institute Computing Information Repository! Browse the below topics for help in the area you are curious about. ==Datacenter Migration== *[[Public Genomics Institute Infrastructure Ready for Migration]] ==GI Public Computing Environment== *[[How to access the public servers]] ==GI Firewalled Computing Environment (PRISM)== *[[Access to the Firewalled Compute Servers]] ==VPN Access== *[[Requirement for users to get GI VPN access]] == NIH dbGaP Access Requirements == *[[Requirements for dbGaP Access]] == giCloud Openstack == *[[Overview of giCloud in the Genomics Institute]] 59f7222b59c9ea10867001860c31462dddece5bd 105 102 2019-02-06T18:06:57Z Weiler 3 wikitext text/x-wiki Welcome to the Genomic Institute Computing Information Repository! Browse the below topics for help in the area you are curious about. ==Datacenter Migration== *[[Public Genomics Institute Infrastructure Ready for Migration]] ==GI Public Computing Environment== *[[How to access the public servers]] ==GI Firewalled Computing Environment (PRISM)== *[[Access to the Firewalled Compute Servers]] ==VPN Access== *[[Requirement for users to get GI VPN access]] == NIH dbGaP Access Requirements == *[[Requirements for dbGaP Access]] == giCloud Openstack == *[[Overview of giCloud in the Genomics Institute]] == Amazon Web Services Account Management == *[[Overview of Getting and Using an AWS IAM Account]] d3a25c8f94a38b3a589811e5ded73de151cfceea 116 105 2019-02-06T23:36:36Z Weiler 3 wikitext text/x-wiki Welcome to the Genomic Institute Computing Information Repository! Browse the below topics for help in the area you are curious about. ==Datacenter Migration== *[[Public Genomics Institute Infrastructure Ready for Migration]] ==GI Public Computing Environment== *[[How to access the public servers]] ==GI Firewalled Computing Environment (PRISM)== *[[Access to the Firewalled Compute Servers]] ==VPN Access== *[[Requirement for users to get GI VPN access]] == NIH dbGaP Access Requirements == *[[Requirements for dbGaP Access]] == giCloud Openstack == *[[Overview of giCloud in the Genomics Institute]] == Amazon Web Services Account Management == *[[Overview of Getting and Using an AWS IAM Account]] *[[AWS Account List and Numbers]] 481da0d869c8d012f505e4b48dcd3e4787d9dfc7 143 116 2019-09-04T19:09:10Z Weiler 3 wikitext text/x-wiki Welcome to the Genomic Institute Computing Information Repository! Browse the below topics for help in the area you are curious about. ==Datacenter Migration== *[[Public Genomics Institute Infrastructure Ready for Migration]] ==GI Public Computing Environment== *[[How to access the public servers]] ==GI Firewalled Computing Environment (PRISM)== *[[Access to the Firewalled Compute Servers]] ==VPN Access== *[[Requirement for users to get GI VPN access]] == NIH dbGaP Access Requirements == *[[Requirements for dbGaP Access]] == giCloud Openstack == *[[Overview of giCloud in the Genomics Institute]] == Amazon Web Services Account Management == *[[Overview of Getting and Using an AWS IAM Account]] *[[AWS Account List and Numbers]] ==Kubernetes Information== *[[Computational Genomics Kubernetes Installation]] d5e7dde13c3caf636be0f5a10f1c1cfed1f342cf Overview of giCloud in the Genomics Institute 0 20 103 2018-09-27T18:00:57Z Weiler 3 Created page with "'''giCloud''' is the Genomics Institute implementation of OpenStack. OpenStack is a IaaS (Infrastructure as a Service) based platform in which you can launch VM instances in..." wikitext text/x-wiki '''giCloud''' is the Genomics Institute implementation of OpenStack. OpenStack is a IaaS (Infrastructure as a Service) based platform in which you can launch VM instances in a cloud environment. More about OpenStack can be found here: https://www.openstack.org Our particular implementation of OpenStack is located behind the GI VPN service, and as such, you cannot launch VM instances to provide "public" services available on the Greater Internet. The VM instances you create are meant for testing software or processing data pipelines. The instances are '''not backed up''' and should be treated as such. They have access outbound to the Greater Internet, such that you can download data from the Internet and install software from the Internet, etc, but no one originating from the Internet outside of the VPN can see your instances. The instances you create are also meant for the processing of secure data, hence being behind the VPN, and also the disks are encrypted to provide an additional layer of physical security to satisfy certain requirements of FISMA and HIPPA. Once you connect to the GI VPN, you can access the web console for giCloud here: http://gicloud.prism Note that the connection is over http and not https. This is expected and normal, and even though your web browser may complain about the connection not being encrypted, it actually '''IS''' encrypted due to the VPN software. But your browser isn't aware of the VPN connection. You can get access to the GI VPN service if you are affiliated with the UCSC Genomics Institute. If you need VPN access, please fulfill the requirements as detailed here: http://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access After you have VPN access, make an appointment with the Genomics Institute Cluster Admin group by emailing 'cluster-admin@soe.ucsc.edu' and we can set you up with a giCloud account. f635e625d8794072e592bc71f751235503e28168 Requirements for dbGaP Access 0 19 104 100 2018-11-09T18:37:45Z Weiler 3 wikitext text/x-wiki If you need NIH dbGaP access, there are several requirements to gaining access - please complete all these requirements '''BEFORE''' requesting dbGaP credentials. NOTE: If you already have GI VPN access to the GI "Prism" Environment, then you have already completed the requirements detailed below - let Rochelle know and we can quickly move to getting you set up. Please use this checklist to make sure that you have completed all '''three''' requirements. '''1'''. Your PI's info and your PI's approval '''2'''. NIH Public Security Refresher Course Certificate '''3'''. Signed NIH Genomic Data Sharing Policy Agreement '''1''': You are required to ask your PI or sponsor to email '''Rochelle Fuller (hrfuller@ucsc.edu)''' requesting dbGaP access for you - this email should include: Your name Your PI's name PI's approval for this access '''2''': You must take the NIH Public Security Refresher Course online, then print out the Completion Certificate (which should have your name on it) at the end of the training and deliver it to Rochelle Fuller. You must complete the course in a single continuous sitting in order to be able to print the certificate at the end: https://irtsectraining.nih.gov/publicUser.aspx Click on the "2018 Information Security, Counterintelligence, Privacy Awareness, Records Management Refresher" link to begin the course. At the end you will be able to print out the completion certificate that should have your name on it. '''3''': Please print, read and sign the last page of the NIH Genomic Data Sharing Policy agreement, located here for download. Just staple the pages together. Please bring the signed document to your appointment. By signing the document you agree that you have read and understand the policies described therein and that you agree to abide by those policies: [[Media:NIH_GDS_Policy.pdf]] We will correspond with you via email on when the appointment will be - please email Rochelle about getting everything set up! ('''hrfuller@ucsc.edu''') 56169cb853baa33b35fd78d1c21cbf8b7133b38e Overview of Getting and Using an AWS IAM Account 0 21 106 2019-02-06T18:21:00Z Weiler 3 Created page with "__TOC__ == Getting Amazon Web Services Access == The Genomics Institute has a series of AWS Accounts that all support different projects. Often if you become associated wit..." wikitext text/x-wiki __TOC__ == Getting Amazon Web Services Access == The Genomics Institute has a series of AWS Accounts that all support different projects. Often if you become associated with one or more of those projects, you will need access to that account or accounts. The way we are managing AWS IAM Account Access is that we have one AWS account that is the 'top level' account that everyone gets access to, and then, once you log in there, you can "Switch Role" into another sub-account that you are running things in. To get access, you will need your PI or Project Manager to email cluster-admin (cluster-admin@soe.ucsc.edu) asking for an AWS account for you, and also in that email to name the projects you will have access to. The cluster-admin group will contact you with your credentials to login. Once you login, you can change your password if you want to and also you will be able to set up MFA (Multi-Factor Authentication) for your account. You will be required to use MFA in order to "Switch Role" into any of the sub-accounts for the projects you are working on. The login URL to use when logging in to the top level account is listed below. The top level account is known as "gi-gateway": https://gi-gateway.signin.aws.amazon.com/console == Configuring Account Credentials == Once you login to the gi-gateway, you will have very few permissions to do anything there - which is normal, since you will not be working in that account anyway. The gi-gateway account is just there to authenticate you to AWS. You can change your password by: aa7aba52293b71c8a5220bb2b08a7327f73bff8a 107 106 2019-02-06T18:31:47Z Weiler 3 wikitext text/x-wiki __TOC__ == Getting Amazon Web Services Access == The Genomics Institute has a series of AWS Accounts that all support different projects. Often if you become associated with one or more of those projects, you will need access to that account or accounts. The way we are managing AWS IAM Account Access is that we have one AWS account that is the 'top level' account that everyone gets access to, and then, once you log in there, you can "Switch Role" into another sub-account that you are running things in. To get access, you will need your PI or Project Manager to email cluster-admin (cluster-admin@soe.ucsc.edu) asking for an AWS account for you, and also in that email to name the projects you will have access to. The cluster-admin group will contact you with your credentials to login. Once you login, you can change your password if you want to and also you will be able to set up MFA (Multi-Factor Authentication) for your account. You will be required to use MFA in order to "Switch Role" into any of the sub-accounts for the projects you are working on. The login URL to use when logging in to the top level account is listed below. The top level account is known as "gi-gateway": https://gi-gateway.signin.aws.amazon.com/console When you login, you will see a couple error messages on the AWS dashboard saying you don't have access to view certain resources - '''this is normal''', so just ignore the error messages. == Configuring Account Credentials == Once you login to the gi-gateway, you will have very few permissions to do anything there - which is normal, since you will not be working in that account anyway. The gi-gateway account is just there to authenticate you to AWS. '''Changing Your Password''' You can change your password by clicking on your username on the top right of the web browser window, just to the right of the little bell. If your username is bill@ucsc.edu, for example: * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window. * Click the '''"My Security Credentials"''' drop-down menu option. * Click the '''"Change Password"''' button to change your password. You will also need to configure MFA on your account before you can switch roles into another account. '''Configuring MFA''' To configure MFA, the most common way to do it is to use '''Google Authenticator''', which is an app available for Apple and Android based cell phones and mobile devices. The app is free, simply download it from the app store to your cell phone or tablet to get started. Other MFA apps may also work but we have not tested everything out there. Once you have Google Authenticator installed, log into the gi-gateway using the above URL, then: * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window (again, bill@ucsc.edu is an example). * Click the '''"My Security Credentials"''' drop-down menu option. * Scroll downt o the MFA (Multi-Factor Authentication) section of the page, and click '''"Manage MFA Device"'''. 4a70d1bf084a6ef500992cc91fd58d3d41525d14 108 107 2019-02-06T18:42:56Z Weiler 3 wikitext text/x-wiki __TOC__ == Getting Amazon Web Services Access == The Genomics Institute has a series of AWS Accounts that all support different projects. Often if you become associated with one or more of those projects, you will need access to that account or accounts. The way we are managing AWS IAM Account Access is that we have one AWS account that is the 'top level' account that everyone gets access to, and then, once you log in there, you can "Switch Role" into another sub-account that you are running things in. To get access, you will need your PI or Project Manager to email cluster-admin (cluster-admin@soe.ucsc.edu) asking for an AWS account for you, and also in that email to name the projects you will have access to. The cluster-admin group will contact you with your credentials to login. Once you login, you can change your password if you want to and also you will be able to set up MFA (Multi-Factor Authentication) for your account. You will be required to use MFA in order to "Switch Role" into any of the sub-accounts for the projects you are working on. The login URL to use when logging in to the top level account is listed below. The top level account is known as "gi-gateway": https://gi-gateway.signin.aws.amazon.com/console When you login, you will see a couple error messages on the AWS dashboard saying you don't have access to view certain resources - '''this is normal''', so just ignore the error messages. == Configuring Account Credentials == Once you login to the gi-gateway, you will have very few permissions to do anything there - which is normal, since you will not be working in that account anyway. The gi-gateway account is just there to authenticate you to AWS. '''Changing Your Password''' You can change your password by clicking on your username on the top right of the web browser window, just to the right of the little bell. If your username is bill@ucsc.edu, for example: * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window. * Click the '''"My Security Credentials"''' drop-down menu option. * Click the '''"Change Password"''' button to change your password. You will also need to configure MFA on your account before you will be allowed to switch roles into another account. '''Configuring MFA''' To configure MFA, the most common way to do it is to use '''Google Authenticator''', which is an app available for Apple and Android based cell phones and mobile devices. The app is free, simply download it from the app store to your cell phone or tablet to get started. Other MFA apps may also work but we have not tested everything out there. Once you have Google Authenticator installed, log into the gi-gateway account using the above URL, then: * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window (again, bill@ucsc.edu is an example). * Click the '''"My Security Credentials"''' drop-down menu option. * Scroll down to the MFA (Multi-Factor Authentication) section of the page, and click '''"Assign MFA Device"'''. * In the following menu select '''"Virtual MFA Device"'''. * In the following window click the '''"Show QR Code"''' link, and the MFA QR barcode will appear on your screen. * Open the Google Authenticator app on your mobile device, and click the little "+" symbol in the top right corner of the app to add an account. * You will then need to select "Scan Barcode" in the Google Authenticaor app to continue, and aim your mobile device camera at the QR barcode. * The new account MFA device should then be set up and you should see a 6 digit number with a small timer to the right of it. You must type in one 6 digit code that it displays into your web browser when asked, then wait for the next code to appear after the timer expires, and type that into the second field. It should then inform you that you have successfully associated an MFA device with your account. Once you have associated an MFA device with your AWS Account, log out, then log back in. It will ask for your username and password, and then ask for your MFA code, which you can view by opening Google Authenticator and seeing what code it is displaying at that time. The code changes every 30 seconds or so. 4b4de1d069ffd06995dbaff06124caed9b4b93a0 109 108 2019-02-06T18:43:37Z Weiler 3 wikitext text/x-wiki __TOC__ == Getting Amazon Web Services Access == The Genomics Institute has a series of AWS Accounts that all support different projects. Often if you become associated with one or more of those projects, you will need access to that account or accounts. The way we are managing AWS IAM Account Access is that we have one AWS account that is the 'top level' account that everyone gets access to, and then, once you log in there, you can "Switch Role" into another sub-account that you are running things in. To get access, you will need your PI or Project Manager to email cluster-admin (cluster-admin@soe.ucsc.edu) asking for an AWS account for you, and also in that email to name the projects you will have access to. The cluster-admin group will contact you with your credentials to login. Once you login, you can change your password if you want to and also you will be able to set up MFA (Multi-Factor Authentication) for your account. You will be required to use MFA in order to "Switch Role" into any of the sub-accounts for the projects you are working on. The login URL to use when logging in to the top level account is listed below. The top level account is known as "gi-gateway": https://gi-gateway.signin.aws.amazon.com/console When you login, you will see a couple error messages on the AWS dashboard saying you don't have access to view certain resources - '''this is normal''', so just ignore the error messages. == Configuring Account Credentials == Once you login to the gi-gateway, you will have very few permissions to do anything there - which is normal, since you will not be working in that account anyway. The gi-gateway account is just there to authenticate you to AWS. '''Changing Your Password''' You can change your password by clicking on your username on the top right of the web browser window, just to the right of the little bell. If your username is bill@ucsc.edu, for example: * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window. * Click the '''"My Security Credentials"''' drop-down menu option. * Click the '''"Change Password"''' button to change your password. You will also need to configure MFA on your account before you will be allowed to switch roles into another account. '''Configuring MFA''' To configure MFA, the most common way to do it is to use '''Google Authenticator''', which is an app available for Apple and Android based cell phones and mobile devices. The app is free, simply download it from the app store to your cell phone or tablet to get started. Other MFA apps may also work but we have not tested everything out there. Once you have Google Authenticator installed, log into the gi-gateway account using the above URL, then: * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window (again, bill@ucsc.edu is an example). * Click the '''"My Security Credentials"''' drop-down menu option. * Scroll down to the MFA (Multi-Factor Authentication) section of the page, and click '''"Assign MFA Device"'''. * In the following menu select '''"Virtual MFA Device"'''. * In the following window click the '''"Show QR Code"''' link, and the MFA QR barcode will appear on your screen. * Open the Google Authenticator app on your mobile device, and click the little "+" symbol in the top right corner of the app to add an account. * You will then need to select "Scan Barcode" in the Google Authenticaor app to continue, and aim your mobile device camera at the QR barcode. * The new account MFA device should then be set up and you should see a 6 digit number with a small timer to the right of it. You must type in one 6 digit code that it displays into your web browser when asked, then wait for the next code to appear after the timer expires, and type that into the second field. It should then inform you that you have successfully associated an MFA device with your account. Once you have associated an MFA device with your AWS Account, log out, then log back in. It will ask for your username and password, and then ask for your MFA code, which you can view by opening Google Authenticator and seeing what code it is displaying at that time. The code changes every 30 seconds or so. 648fcf9c7470f6991282bf7aef93e696c1df4852 110 109 2019-02-06T18:55:56Z Weiler 3 wikitext text/x-wiki __TOC__ == Getting Amazon Web Services Access == The Genomics Institute has a series of AWS Accounts that all support different projects. Often if you become associated with one or more of those projects, you will need access to that account or accounts. The way we are managing AWS IAM Account Access is that we have one AWS account that is the 'top level' account that everyone gets access to, and then, once you log in there, you can "Switch Role" into another sub-account that you are running things in. To get access, you will need your PI or Project Manager to email cluster-admin (cluster-admin@soe.ucsc.edu) asking for an AWS account for you, and also in that email to name the projects you will have access to. The cluster-admin group will contact you with your credentials to login. Once you login, you can change your password if you want to and also you will be able to set up MFA (Multi-Factor Authentication) for your account. You will be required to use MFA in order to "Switch Role" into any of the sub-accounts for the projects you are working on. The login URL to use when logging in to the top level account is listed below. The top level account is known as "gi-gateway": https://gi-gateway.signin.aws.amazon.com/console When you login, you will see a couple error messages on the AWS dashboard saying you don't have access to view certain resources - '''this is normal''', so just ignore the error messages. == Configuring Account Credentials == Once you login to the gi-gateway, you will have very few permissions to do anything there - which is normal, since you will not be working in that account anyway. The gi-gateway account is just there to authenticate you to AWS. '''Changing Your Password''' You can change your password by clicking on your username on the top right of the web browser window, just to the right of the little bell. If your username is bill@ucsc.edu, for example: * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window. * Click the '''"My Security Credentials"''' drop-down menu option. * Click the '''"Change Password"''' button to change your password. You will also need to configure MFA on your account before you will be allowed to switch roles into another account. '''Configuring MFA''' To configure MFA, the most common way to do it is to use '''Google Authenticator''', which is an app available for Apple and Android based cell phones and mobile devices. The app is free, simply download it from the app store to your cell phone or tablet to get started. Other MFA apps may also work but we have not tested everything out there. Once you have Google Authenticator installed, log into the gi-gateway account using the above URL, then: * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window (again, bill@ucsc.edu is an example). * Click the '''"My Security Credentials"''' drop-down menu option. * Scroll down to the MFA (Multi-Factor Authentication) section of the page, and click '''"Assign MFA Device"'''. * In the following menu select '''"Virtual MFA Device"'''. * In the following window click the '''"Show QR Code"''' link, and the MFA QR barcode will appear on your screen. * Open the Google Authenticator app on your mobile device, and click the little "+" symbol in the top right corner of the app to add an account. * You will then need to select "Scan Barcode" in the Google Authenticaor app to continue, and aim your mobile device camera at the QR barcode. * The new account MFA device should then be set up and you should see a 6 digit number with a small timer to the right of it. You must type in one 6 digit code that it displays into your web browser when asked, then wait for the next code to appear after the timer expires, and type that into the second field. It should then inform you that you have successfully associated an MFA device with your account. Once you have associated an MFA device with your AWS Account, log out, then log back in. It will ask for your username and password, and then ask for your MFA code, which you can view by opening Google Authenticator and seeing what code it is displaying at that time. The code changes every 30 seconds or so. == Switching Roles into Another AWS Account == Now that you have configured a password and enabled MFA, you will be allowed to "Switch Roles" into another account such that you can begin work there. The first time you switch roles into an account it will ask you a few questions, but subsequently it will remember which roles you have access to and they will become a menu item you can click on to quickly switch roles. Let's assume that you want to switch to the 'pangenomics' AWS account, and you have been already granted access to do so by the cluster-admin group. After logging into the 'gi-gateway' account at the URL listed here (same as above): https://gi-gateway.signin.aws.amazon.com/console Do the following to switch roles into the 'pangenomics' account (as an example): * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window (again, bill@ucsc.edu is an example). * Click the '''"Switch Role"''' option in the drop-down menu. * In the following menu it will ask you about the role you will be assuming. In our example we will use the following: -Account* = pangenomics -Role* = developer -Display Name = [leave blank, or use a short phrase] -Color = [choose a color for this role] * Then click the "Switch Role" button. If all went well you should be dumped into the 'pangenomics' account, and you should be identified in the top righthand corner of the page as '''"develop @ pangenomics"''', indicating your role and the account you are active in. You can then work as normal in that account. If you wish to switch context back to the 'gi-gateway' account in order to manage something, or to switch to another role in another account, simple: * Click '''"develop @ pangenomics"''' in the top right corner of the window. * Select '''"Back to bill@ucsc.edu"''' You will then be sent back to the 'gi-gateway' context, and you can add another role to switch into, manage your credentials and further switch roles. == API Access and Secret Keys == fda0a9d1700cf84879b08ee25d20f03bf6280e43 111 110 2019-02-06T19:05:40Z Weiler 3 wikitext text/x-wiki __TOC__ == Getting Amazon Web Services Access == The Genomics Institute has a series of AWS Accounts that all support different projects. Often if you become associated with one or more of those projects, you will need access to that account or accounts. The way we are managing AWS IAM Account Access is that we have one AWS account that is the 'top level' account that everyone gets access to, and then, once you log in there, you can "Switch Role" into another sub-account that you are running things in. To get access, you will need your PI or Project Manager to email cluster-admin (cluster-admin@soe.ucsc.edu) asking for an AWS account for you, and also in that email to name the projects you will have access to. The cluster-admin group will contact you with your credentials to login. Once you login, you can change your password if you want to and also you will be able to set up MFA (Multi-Factor Authentication) for your account. You will be required to use MFA in order to "Switch Role" into any of the sub-accounts for the projects you are working on. The login URL to use when logging in to the top level account is listed below. The top level account is known as "gi-gateway": https://gi-gateway.signin.aws.amazon.com/console When you login, you will see a couple error messages on the AWS dashboard saying you don't have access to view certain resources - '''this is normal''', so just ignore the error messages. == Configuring Account Credentials == Once you login to the gi-gateway, you will have very few permissions to do anything there - which is normal, since you will not be working in that account anyway. The gi-gateway account is just there to authenticate you to AWS. '''Changing Your Password''' You can change your password by clicking on your username on the top right of the web browser window, just to the right of the little bell. If your username is bill@ucsc.edu, for example: * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window. * Click the '''"My Security Credentials"''' drop-down menu option. * Click the '''"Change Password"''' button to change your password. You will also need to configure MFA on your account before you will be allowed to switch roles into another account. '''Configuring MFA''' To configure MFA, the most common way to do it is to use '''Google Authenticator''', which is an app available for Apple and Android based cell phones and mobile devices. The app is free, simply download it from the app store to your cell phone or tablet to get started. Other MFA apps may also work but we have not tested everything out there. Once you have Google Authenticator installed, log into the gi-gateway account using the above URL, then: * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window (again, bill@ucsc.edu is an example). * Click the '''"My Security Credentials"''' drop-down menu option. * Scroll down to the MFA (Multi-Factor Authentication) section of the page, and click '''"Assign MFA Device"'''. * In the following menu select '''"Virtual MFA Device"'''. * In the following window click the '''"Show QR Code"''' link, and the MFA QR barcode will appear on your screen. * Open the Google Authenticator app on your mobile device, and click the little "+" symbol in the top right corner of the app to add an account. * You will then need to select "Scan Barcode" in the Google Authenticaor app to continue, and aim your mobile device camera at the QR barcode. * The new account MFA device should then be set up and you should see a 6 digit number with a small timer to the right of it. You must type in one 6 digit code that it displays into your web browser when asked, then wait for the next code to appear after the timer expires, and type that into the second field. It should then inform you that you have successfully associated an MFA device with your account. Once you have associated an MFA device with your AWS Account, log out, then log back in. It will ask for your username and password, and then ask for your MFA code, which you can view by opening Google Authenticator and seeing what code it is displaying at that time. The code changes every 30 seconds or so. == Switching Roles into Another AWS Account == Now that you have configured a password and enabled MFA, you will be allowed to "Switch Roles" into another account such that you can begin work there. The first time you switch roles into an account it will ask you a few questions, but subsequently it will remember which roles you have access to and they will become a menu item you can click on to quickly switch roles. Let's assume that you want to switch to the 'pangenomics' AWS account, and you have been already granted access to do so by the cluster-admin group. After logging into the 'gi-gateway' account at the URL listed here (same as above): https://gi-gateway.signin.aws.amazon.com/console Do the following to switch roles into the 'pangenomics' account (as an example): * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window (again, bill@ucsc.edu is an example). * Click the '''"Switch Role"''' option in the drop-down menu. * In the following menu it will ask you about the role you will be assuming. In our example we will use the following: -Account* = pangenomics -Role* = developer -Display Name = [leave blank, or use a short phrase] -Color = [choose a color for this role] * Then click the "Switch Role" button. If all went well you should be dumped into the 'pangenomics' account, and you should be identified in the top right hand corner of the page as '''"develop @ pangenomics"''', indicating your role and the account you are active in. You can then work as normal in that account. If you have not yet been given access to that role, you will receive an error message and not be allowed to switch roles. If you wish to switch context back to the 'gi-gateway' account in order to manage something, or to switch to another role in another account, simple: * Click '''"develop @ pangenomics"''' in the top right corner of the window. * Select '''"Back to bill@ucsc.edu"''' You will then be sent back to the 'gi-gateway' context, and you can add another role to switch into, manage your credentials and further switch roles. == API Access and Secret Keys == If you require programmatic access to AWS, you will very likely be familiar with the AWS concept of Access Keys and Secret Keys, which can be used by scripts to authenticate yourself to AWS and use the APIs there without using the web console to authenticate. In the past, access keys and secret keys could be used by users with no further authentication. This introduces a security risk, as the management of those keys must be carefully guarded - if anyone gets your keys, they can rack up charges on your AWS account without your knowledge! Using the "Assume Role" mechanism we are now using, Access Keys and Secret Keys can still be created by users '''while logged into the gi-gateway account only'''. Do not try to create keys while you have "Switched Roles" into another account. Keys you create in the top level 'gi-gateway' account will work for you in any sub-account you have access to switch roles to. You will need to do a little more configuration for your keys to work from a UNIX command line however. 81395e129f3944bc0faaf1611c2ecb1dcfe82c66 112 111 2019-02-06T19:47:54Z Weiler 3 wikitext text/x-wiki __TOC__ == Getting Amazon Web Services Access == The Genomics Institute has a series of AWS Accounts that all support different projects. Often if you become associated with one or more of those projects, you will need access to that account or accounts. The way we are managing AWS IAM Account Access is that we have one AWS account that is the 'top level' account that everyone gets access to, and then, once you log in there, you can "Switch Role" into another sub-account that you are running things in. To get access, you will need your PI or Project Manager to email cluster-admin (cluster-admin@soe.ucsc.edu) asking for an AWS account for you, and also in that email to name the projects you will have access to. The cluster-admin group will contact you with your credentials to login. Once you login, you can change your password if you want to and also you will be able to set up MFA (Multi-Factor Authentication) for your account. You will be required to use MFA in order to "Switch Role" into any of the sub-accounts for the projects you are working on. The login URL to use when logging in to the top level account is listed below. The top level account is known as "gi-gateway": https://gi-gateway.signin.aws.amazon.com/console When you login, you will see a couple error messages on the AWS dashboard saying you don't have access to view certain resources - '''this is normal''', so just ignore the error messages. == Configuring Account Credentials == Once you login to the gi-gateway, you will have very few permissions to do anything there - which is normal, since you will not be working in that account anyway. The gi-gateway account is just there to authenticate you to AWS. '''Changing Your Password''' You can change your password by clicking on your username on the top right of the web browser window, just to the right of the little bell. If your username is bill@ucsc.edu, for example: * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window. * Click the '''"My Security Credentials"''' drop-down menu option. * Click the '''"Change Password"''' button to change your password. You will also need to configure MFA on your account before you will be allowed to switch roles into another account. '''Configuring MFA''' To configure MFA, the most common way to do it is to use '''Google Authenticator''', which is an app available for Apple and Android based cell phones and mobile devices. The app is free, simply download it from the app store to your cell phone or tablet to get started. Other MFA apps may also work but we have not tested everything out there. Once you have Google Authenticator installed, log into the gi-gateway account using the above URL, then: * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window (again, bill@ucsc.edu is an example). * Click the '''"My Security Credentials"''' drop-down menu option. * Scroll down to the MFA (Multi-Factor Authentication) section of the page, and click '''"Assign MFA Device"'''. * In the following menu select '''"Virtual MFA Device"'''. * In the following window click the '''"Show QR Code"''' link, and the MFA QR barcode will appear on your screen. * Open the Google Authenticator app on your mobile device, and click the little "+" symbol in the top right corner of the app to add an account. * You will then need to select "Scan Barcode" in the Google Authenticaor app to continue, and aim your mobile device camera at the QR barcode. * The new account MFA device should then be set up and you should see a 6 digit number with a small timer to the right of it. You must type in one 6 digit code that it displays into your web browser when asked, then wait for the next code to appear after the timer expires, and type that into the second field. It should then inform you that you have successfully associated an MFA device with your account. Once you have associated an MFA device with your AWS Account, log out, then log back in. It will ask for your username and password, and then ask for your MFA code, which you can view by opening Google Authenticator and seeing what code it is displaying at that time. The code changes every 30 seconds or so. == Switching Roles into Another AWS Account == Now that you have configured a password and enabled MFA, you will be allowed to "Switch Roles" into another account such that you can begin work there. The first time you switch roles into an account it will ask you a few questions, but subsequently it will remember which roles you have access to and they will become a menu item you can click on to quickly switch roles. Let's assume that you want to switch to the 'pangenomics' AWS account, and you have been already granted access to do so by the cluster-admin group. After logging into the 'gi-gateway' account at the URL listed here (same as above): https://gi-gateway.signin.aws.amazon.com/console Do the following to switch roles into the 'pangenomics' account (as an example): * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window (again, bill@ucsc.edu is an example). * Click the '''"Switch Role"''' option in the drop-down menu. * In the following menu it will ask you about the role you will be assuming. In our example we will use the following: -Account* = pangenomics -Role* = developer -Display Name = [leave blank, or use a short phrase] -Color = [choose a color for this role] * Then click the "Switch Role" button. If all went well you should be dumped into the 'pangenomics' account, and you should be identified in the top right hand corner of the page as '''"develop @ pangenomics"''', indicating your role and the account you are active in. You can then work as normal in that account. If you have not yet been given access to that role, you will receive an error message and not be allowed to switch roles. '''NOTE:''' When you switch roles, it may dump you into a region that you don't expect it to. Always verify the region you are in by looking at the top right of the web page - it will display your region there. Most of our stuff exists in "Oregon" (us-west-2), but some items appear in other regions on a per-case basis. If you wish to switch context back to the 'gi-gateway' account in order to manage something, or to switch to another role in another account, simple: * Click '''"develop @ pangenomics"''' in the top right corner of the window. * Select '''"Back to bill@ucsc.edu"''' You will then be sent back to the 'gi-gateway' context, and you can add another role to switch into, manage your credentials and further switch roles. == API Access and Secret Keys == If you require programmatic access to AWS, you will very likely be familiar with the AWS concept of Access Keys and Secret Keys, which can be used by scripts to authenticate yourself to AWS and use the APIs there without using the web console to authenticate. In the past, access keys and secret keys could be used by users with no further authentication. This introduces a security risk, as the management of those keys must be carefully guarded - if anyone gets your keys, they can rack up charges on your AWS account without your knowledge! Using the "Assume Role" mechanism we are now using, Access Keys and Secret Keys can still be created by users '''while logged into the gi-gateway account only'''. Do not try to create keys while you have "Switched Roles" into another account. Keys you create in the top level 'gi-gateway' account will work for you in any sub-account you have access to switch roles to. You will need to do a little more configuration for your keys to work from a UNIX command line however. When using the '''"aws"''' command line tool, assuming you have it installed (the process of which is outside the scope of this document), you would use the steps outlined in this document to configure it: https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html 2d33be833dafb25c2311beb9c2e146b89abb84a9 113 112 2019-02-06T19:48:48Z Weiler 3 wikitext text/x-wiki __TOC__ == Getting Amazon Web Services Access == The Genomics Institute has a series of AWS Accounts that all support different projects. Often if you become associated with one or more of those projects, you will need access to that account or accounts. The way we are managing AWS IAM Account Access is that we have one AWS account that is the 'top level' account that everyone gets access to, and then, once you log in there, you can "Switch Role" into another sub-account that you are running things in. To get access, you will need your PI or Project Manager to email cluster-admin (cluster-admin@soe.ucsc.edu) asking for an AWS account for you, and also in that email to name the projects you will have access to. The cluster-admin group will contact you with your credentials to login. Once you login, you can change your password if you want to and also you will be able to set up MFA (Multi-Factor Authentication) for your account. You will be required to use MFA in order to "Switch Role" into any of the sub-accounts for the projects you are working on. The login URL to use when logging in to the top level account is listed below. The top level account is known as "gi-gateway": https://gi-gateway.signin.aws.amazon.com/console When you login, you will see a couple error messages on the AWS dashboard saying you don't have access to view certain resources - '''this is normal''', so just ignore the error messages. == Configuring Account Credentials == Once you login to the gi-gateway, you will have very few permissions to do anything there - which is normal, since you will not be working in that account anyway. The gi-gateway account is just there to authenticate you to AWS. '''Changing Your Password''' You can change your password by clicking on your username on the top right of the web browser window, just to the right of the little bell. If your username is bill@ucsc.edu, for example: * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window. * Click the '''"My Security Credentials"''' drop-down menu option. * Click the '''"Change Password"''' button to change your password. You will also need to configure MFA on your account before you will be allowed to switch roles into another account. '''Configuring MFA''' To configure MFA, the most common way to do it is to use '''Google Authenticator''', which is an app available for Apple and Android based cell phones and mobile devices. The app is free, simply download it from the app store to your cell phone or tablet to get started. Other MFA apps may also work but we have not tested everything out there. Once you have Google Authenticator installed, log into the gi-gateway account using the above URL, then: * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window (again, bill@ucsc.edu is an example). * Click the '''"My Security Credentials"''' drop-down menu option. * Scroll down to the MFA (Multi-Factor Authentication) section of the page, and click '''"Assign MFA Device"'''. * In the following menu select '''"Virtual MFA Device"'''. * In the following window click the '''"Show QR Code"''' link, and the MFA QR barcode will appear on your screen. * Open the Google Authenticator app on your mobile device, and click the little "+" symbol in the top right corner of the app to add an account. * You will then need to select "Scan Barcode" in the Google Authenticaor app to continue, and aim your mobile device camera at the QR barcode. * The new account MFA device should then be set up and you should see a 6 digit number with a small timer to the right of it. You must type in one 6 digit code that it displays into your web browser when asked, then wait for the next code to appear after the timer expires, and type that into the second field. It should then inform you that you have successfully associated an MFA device with your account. Once you have associated an MFA device with your AWS Account, log out, then log back in. It will ask for your username and password, and then ask for your MFA code, which you can view by opening Google Authenticator and seeing what code it is displaying at that time. The code changes every 30 seconds or so. == Switching Roles into Another AWS Account == Now that you have configured a password and enabled MFA, you will be allowed to "Switch Roles" into another account such that you can begin work there. The first time you switch roles into an account it will ask you a few questions, but subsequently it will remember which roles you have access to and they will become a menu item you can click on to quickly switch roles. Let's assume that you want to switch to the 'pangenomics' AWS account, and you have been already granted access to do so by the cluster-admin group. After logging into the 'gi-gateway' account at the URL listed here (same as above): https://gi-gateway.signin.aws.amazon.com/console Do the following to switch roles into the 'pangenomics' account (as an example): * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window (again, bill@ucsc.edu is an example). * Click the '''"Switch Role"''' option in the drop-down menu. * In the following menu it will ask you about the role you will be assuming. In our example we will use the following: -Account* = pangenomics -Role* = developer -Display Name = [leave blank, or use a short phrase] -Color = [choose a color for this role] * Then click the "Switch Role" button. If all went well you should be dumped into the 'pangenomics' account, and you should be identified in the top right hand corner of the page as '''"developer @ pangenomics"''', indicating your role and the account you are active in. You can then work as normal in that account. If you have not yet been given access to that role, you will receive an error message and not be allowed to switch roles. '''NOTE:''' When you switch roles, it may dump you into a region that you don't expect it to. Always verify the region you are in by looking at the top right of the web page - it will display your region there. Most of our stuff exists in "Oregon" (us-west-2), but some items appear in other regions on a per-case basis. If you wish to switch context back to the 'gi-gateway' account in order to manage something, or to switch to another role in another account, simple: * Click '''"develop @ pangenomics"''' in the top right corner of the window. * Select '''"Back to bill@ucsc.edu"''' You will then be sent back to the 'gi-gateway' context, and you can add another role to switch into, manage your credentials and further switch roles. == API Access and Secret Keys == If you require programmatic access to AWS, you will very likely be familiar with the AWS concept of Access Keys and Secret Keys, which can be used by scripts to authenticate yourself to AWS and use the APIs there without using the web console to authenticate. In the past, access keys and secret keys could be used by users with no further authentication. This introduces a security risk, as the management of those keys must be carefully guarded - if anyone gets your keys, they can rack up charges on your AWS account without your knowledge! Using the "Assume Role" mechanism we are now using, Access Keys and Secret Keys can still be created by users '''while logged into the gi-gateway account only'''. Do not try to create keys while you have "Switched Roles" into another account. Keys you create in the top level 'gi-gateway' account will work for you in any sub-account you have access to switch roles to. You will need to do a little more configuration for your keys to work from a UNIX command line however. When using the '''"aws"''' command line tool, assuming you have it installed (the process of which is outside the scope of this document), you would use the steps outlined in this document to configure it: https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html b9cbf7525a3bc394eeca792e4ba83fb112d63cfb 114 113 2019-02-06T22:24:48Z Weiler 3 wikitext text/x-wiki __TOC__ == Getting Amazon Web Services Access == The Genomics Institute has a series of AWS Accounts that all support different projects. Often if you become associated with one or more of those projects, you will need access to that account or accounts. The way we are managing AWS IAM Account Access is that we have one AWS account that is the 'top level' account that everyone gets access to, and then, once you log in there, you can "Switch Role" into another sub-account that you are running things in. To get access, you will need your PI or Project Manager to email cluster-admin (cluster-admin@soe.ucsc.edu) asking for an AWS account for you, and also in that email to name the projects you will have access to. The cluster-admin group will contact you with your credentials to login. Once you login, you can change your password if you want to and also you will be able to set up MFA (Multi-Factor Authentication) for your account. You will be required to use MFA in order to "Switch Role" into any of the sub-accounts for the projects you are working on. The login URL to use when logging in to the top level account is listed below. The top level account is known as "gi-gateway": https://gi-gateway.signin.aws.amazon.com/console When you login, you will see a couple error messages on the AWS dashboard saying you don't have access to view certain resources - '''this is normal''', so just ignore the error messages. == Configuring Account Credentials == Once you login to the gi-gateway, you will have very few permissions to do anything there - which is normal, since you will not be working in that account anyway. The gi-gateway account is just there to authenticate you to AWS. '''Changing Your Password''' You can change your password by clicking on your username on the top right of the web browser window, just to the right of the little bell. If your username is bill@ucsc.edu, for example: * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window. * Click the '''"My Security Credentials"''' drop-down menu option. * Click the '''"Change Password"''' button to change your password. You will also need to configure MFA on your account before you will be allowed to switch roles into another account. '''Configuring MFA''' To configure MFA, the most common way to do it is to use '''Google Authenticator''', which is an app available for Apple and Android based cell phones and mobile devices. The app is free, simply download it from the app store to your cell phone or tablet to get started. Other MFA apps may also work but we have not tested everything out there. Once you have Google Authenticator installed, log into the gi-gateway account using the above URL, then: * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window (again, bill@ucsc.edu is an example). * Click the '''"My Security Credentials"''' drop-down menu option. * Scroll down to the MFA (Multi-Factor Authentication) section of the page, and click '''"Assign MFA Device"'''. * In the following menu select '''"Virtual MFA Device"'''. * In the following window click the '''"Show QR Code"''' link, and the MFA QR barcode will appear on your screen. * Open the Google Authenticator app on your mobile device, and click the little "+" symbol in the top right corner of the app to add an account. * You will then need to select "Scan Barcode" in the Google Authenticaor app to continue, and aim your mobile device camera at the QR barcode. * The new account MFA device should then be set up and you should see a 6 digit number with a small timer to the right of it. You must type in one 6 digit code that it displays into your web browser when asked, then wait for the next code to appear after the timer expires, and type that into the second field. It should then inform you that you have successfully associated an MFA device with your account. Once you have associated an MFA device with your AWS Account, log out, then log back in. It will ask for your username and password, and then ask for your MFA code, which you can view by opening Google Authenticator and seeing what code it is displaying at that time. The code changes every 30 seconds or so. == Switching Roles into Another AWS Account == Now that you have configured a password and enabled MFA, you will be allowed to "Switch Roles" into another account such that you can begin work there. The first time you switch roles into an account it will ask you a few questions, but subsequently it will remember which roles you have access to and they will become a menu item you can click on to quickly switch roles. Let's assume that you want to switch to the 'pangenomics' AWS account, and you have been already granted access to do so by the cluster-admin group. After logging into the 'gi-gateway' account at the URL listed here (same as above): https://gi-gateway.signin.aws.amazon.com/console Do the following to switch roles into the 'pangenomics' account (as an example): * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window (again, bill@ucsc.edu is an example). * Click the '''"Switch Role"''' option in the drop-down menu. * In the following menu it will ask you about the role you will be assuming. In our example we will use the following: -Account* = pangenomics -Role* = developer -Display Name = [leave blank, or use a short phrase] -Color = [choose a color for this role] * Then click the "Switch Role" button. If all went well you should be dumped into the 'pangenomics' account, and you should be identified in the top right hand corner of the page as '''"developer @ pangenomics"''', indicating your role and the account you are active in. You can then work as normal in that account. If you have not yet been given access to that role, you will receive an error message and not be allowed to switch roles. '''NOTE:''' When you switch roles, it may dump you into a region that you don't expect it to. Always verify the region you are in by looking at the top right of the web page - it will display your region there. Most of our stuff exists in "Oregon" (us-west-2), but some items appear in other regions on a per-case basis. If you wish to switch context back to the 'gi-gateway' account in order to manage something, or to switch to another role in another account, simple: * Click '''"develop @ pangenomics"''' in the top right corner of the window. * Select '''"Back to bill@ucsc.edu"''' You will then be sent back to the 'gi-gateway' context, and you can add another role to switch into, manage your credentials and further switch roles. == API Access and Secret Keys == If you require programmatic access to AWS, you will very likely be familiar with the AWS concept of Access Keys and Secret Keys, which can be used by scripts to authenticate yourself to AWS and use the APIs there without using the web console to authenticate. In the past, access keys and secret keys could be used by users with no further authentication. This introduces a security risk, as the management of those keys must be carefully guarded - if anyone gets your keys, they can rack up charges on your AWS account without your knowledge! Using the "Assume Role" mechanism we are now using, Access Keys and Secret Keys can still be created by users '''while logged into the gi-gateway account only'''. Do not try to create keys while you have "Switched Roles" into another account. Keys you create in the top level 'gi-gateway' account will work for you in any sub-account you have access to switch roles to. You will need to do a little more configuration for your keys to work from a UNIX command line however. When using the '''"aws"''' command line tool, assuming you have it installed (the process of which is outside the scope of this document), you would use the steps outlined in this document to configure it: https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html That document has a lot of other really useful information in it - if you plan on using keys for API access, we highly recommend reading it through. Generically, if you plan on using keys for API Access, minimally you will need to configure the "aws" utility and then tweak the config a bit for our setup. To start, run "aws configure". It should look something like this: $ aws configure AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE AWS Secret Access Key [None]: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY Default region name [None]: us-west-2 Default output format [None]: Most folks do that to start. It creates two files: ~/.aws/config ~/.aws/credentials Those two files are important to access AWS via the 'aws' command. '''~/.aws/credentials''' This file contains your access key and secret key, and should not need to be modified after running 'aws configure'. Your same keys can be used to access any roles in any accounts you have access to. '''~/.aws/config''' This file contains some account information you will need to tweak. You may want to configure something like this: [default] region = us-west-2 [profile pangenomics-developer] source_profile = default role_arn = arn:aws:iam::422448306679:role/developer mfa_serial = arn:aws:iam::652235167018:mfa/bill@ucsc.edu The 'role-arn' line contains the role and account number you are accessing. You can see a list of live account numbers HERE, find the account number you need and enter it on the role_arn line, as well as the role name. You will get the role name from the cluster-admin group when you get access. The 'mda_serial' line contains the identifier for your MFA device. It will always look like '''"arn:aws:iam::652235167018:mfa/[your_iam_username]"'''. The account number listed there will always be '652235167018' because that is the account number of the top level 'gi-gateway' account. Once that is configured, you should be able to reference the profile you just created when using the aws command, like so: $ aws s3 ls --profile pangenomics-developer It will ask you for your MFA code and then run the command. Once you enter the MFA code, the token it creates will be valid by default for c7a676a40e18f571c5cfeeeb048aea0179c2f16f 115 114 2019-02-06T23:35:48Z Weiler 3 wikitext text/x-wiki __TOC__ == Getting Amazon Web Services Access == The Genomics Institute has a series of AWS Accounts that all support different projects. Often if you become associated with one or more of those projects, you will need access to that account or accounts. The way we are managing AWS IAM Account Access is that we have one AWS account that is the 'top level' account that everyone gets access to, and then, once you log in there, you can "Switch Role" into another sub-account that you are running things in. To get access, you will need your PI or Project Manager to email cluster-admin (cluster-admin@soe.ucsc.edu) asking for an AWS account for you, and also in that email to name the projects you will have access to. The cluster-admin group will contact you with your credentials to login. Once you login, you can change your password if you want to and also you will be able to set up MFA (Multi-Factor Authentication) for your account. You will be required to use MFA in order to "Switch Role" into any of the sub-accounts for the projects you are working on. The login URL to use when logging in to the top level account is listed below. The top level account is known as "gi-gateway": https://gi-gateway.signin.aws.amazon.com/console When you login, you will see a couple error messages on the AWS dashboard saying you don't have access to view certain resources - '''this is normal''', so just ignore the error messages. == Configuring Account Credentials == Once you login to the gi-gateway, you will have very few permissions to do anything there - which is normal, since you will not be working in that account anyway. The gi-gateway account is just there to authenticate you to AWS. '''Changing Your Password''' You can change your password by clicking on your username on the top right of the web browser window, just to the right of the little bell. If your username is bill@ucsc.edu, for example: * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window. * Click the '''"My Security Credentials"''' drop-down menu option. * Click the '''"Change Password"''' button to change your password. You will also need to configure MFA on your account before you will be allowed to switch roles into another account. '''Configuring MFA''' To configure MFA, the most common way to do it is to use '''Google Authenticator''', which is an app available for Apple and Android based cell phones and mobile devices. The app is free, simply download it from the app store to your cell phone or tablet to get started. Other MFA apps may also work but we have not tested everything out there. Once you have Google Authenticator installed, log into the gi-gateway account using the above URL, then: * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window (again, bill@ucsc.edu is an example). * Click the '''"My Security Credentials"''' drop-down menu option. * Scroll down to the MFA (Multi-Factor Authentication) section of the page, and click '''"Assign MFA Device"'''. * In the following menu select '''"Virtual MFA Device"'''. * In the following window click the '''"Show QR Code"''' link, and the MFA QR barcode will appear on your screen. * Open the Google Authenticator app on your mobile device, and click the little "+" symbol in the top right corner of the app to add an account. * You will then need to select "Scan Barcode" in the Google Authenticaor app to continue, and aim your mobile device camera at the QR barcode. * The new account MFA device should then be set up and you should see a 6 digit number with a small timer to the right of it. You must type in one 6 digit code that it displays into your web browser when asked, then wait for the next code to appear after the timer expires, and type that into the second field. It should then inform you that you have successfully associated an MFA device with your account. Once you have associated an MFA device with your AWS Account, log out, then log back in. It will ask for your username and password, and then ask for your MFA code, which you can view by opening Google Authenticator and seeing what code it is displaying at that time. The code changes every 30 seconds or so. == Switching Roles into Another AWS Account == Now that you have configured a password and enabled MFA, you will be allowed to "Switch Roles" into another account such that you can begin work there. The first time you switch roles into an account it will ask you a few questions, but subsequently it will remember which roles you have access to and they will become a menu item you can click on to quickly switch roles. Let's assume that you want to switch to the 'pangenomics' AWS account, and you have been already granted access to do so by the cluster-admin group. After logging into the 'gi-gateway' account at the URL listed here (same as above): https://gi-gateway.signin.aws.amazon.com/console Do the following to switch roles into the 'pangenomics' account (as an example): * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window (again, bill@ucsc.edu is an example). * Click the '''"Switch Role"''' option in the drop-down menu. * In the following menu it will ask you about the role you will be assuming. In our example we will use the following: -Account* = pangenomics -Role* = developer -Display Name = [leave blank, or use a short phrase] -Color = [choose a color for this role] * Then click the "Switch Role" button. If all went well you should be dumped into the 'pangenomics' account, and you should be identified in the top right hand corner of the page as '''"developer @ pangenomics"''', indicating your role and the account you are active in. You can then work as normal in that account. If you have not yet been given access to that role, you will receive an error message and not be allowed to switch roles. '''NOTE:''' When you switch roles, it may dump you into a region that you don't expect it to. Always verify the region you are in by looking at the top right of the web page - it will display your region there. Most of our stuff exists in "Oregon" (us-west-2), but some items appear in other regions on a per-case basis. If you wish to switch context back to the 'gi-gateway' account in order to manage something, or to switch to another role in another account, simple: * Click '''"develop @ pangenomics"''' in the top right corner of the window. * Select '''"Back to bill@ucsc.edu"''' You will then be sent back to the 'gi-gateway' context, and you can add another role to switch into, manage your credentials and further switch roles. == API Access and Secret Keys == If you require programmatic access to AWS, you will very likely be familiar with the AWS concept of Access Keys and Secret Keys, which can be used by scripts to authenticate yourself to AWS and use the APIs there without using the web console to authenticate. In the past, access keys and secret keys could be used by users with no further authentication. This introduces a security risk, as the management of those keys must be carefully guarded - if anyone gets your keys, they can rack up charges on your AWS account without your knowledge! Using the "Assume Role" mechanism we are now using, Access Keys and Secret Keys can still be created by users '''while logged into the gi-gateway account only'''. Do not try to create keys while you have "Switched Roles" into another account. Keys you create in the top level 'gi-gateway' account will work for you in any sub-account you have access to switch roles to. You will need to do a little more configuration for your keys to work from a UNIX command line however. When using the '''"aws"''' command line tool, assuming you have it installed (the process of which is outside the scope of this document), you would use the steps outlined in this document to configure it: https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html That document has a lot of other really useful information in it - if you plan on using keys for API access, we highly recommend reading it through. Generically, if you plan on using keys for API Access, minimally you will need to configure the "aws" utility and then tweak the config a bit for our setup. To start, run "aws configure". It should look something like this: $ aws configure AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE AWS Secret Access Key [None]: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY Default region name [None]: us-west-2 Default output format [None]: Most folks do that to start. It creates two files: ~/.aws/config ~/.aws/credentials Those two files are important to access AWS via the 'aws' command. '''~/.aws/credentials''' This file contains your access key and secret key, and should not need to be modified after running 'aws configure'. Your same keys can be used to access any roles in any accounts you have access to. '''~/.aws/config''' This file contains some account information you will need to tweak. You may want to configure something like this: [default] region = us-west-2 [profile pangenomics-developer] source_profile = default role_arn = arn:aws:iam::422448306679:role/developer mfa_serial = arn:aws:iam::652235167018:mfa/bill@ucsc.edu The 'role-arn' line contains the role and account number you are accessing. You can see a list of live account numbers HERE, find the account number you need and enter it on the role_arn line, as well as the role name. You will get the role name from the cluster-admin group when you get access. The 'mda_serial' line contains the identifier for your MFA device. It will always look like '''"arn:aws:iam::652235167018:mfa/[your_iam_username]"'''. The account number listed there will always be '652235167018' because that is the account number of the top level 'gi-gateway' account. Once that is configured, you should be able to reference the profile you just created when using the aws command, like so: $ aws s3 ls --profile pangenomics-developer It will ask you for your MFA code and then run the command. Once you enter the MFA code, the token it creates will be valid by default for one hour, so you can run other 'aws' cli commands for one hour without the need to re-authenticate with MFA. After one hour, you will need to authenticate via MFA again. You can extend your session length from one hour to twelve hours but utilizing the AWS Security Token Service (AWS STS). See this page for more information on how to do this: https://docs.aws.amazon.com/cli/latest/reference/sts/assume-role.html#examples The examples at the bottom are particularly useful. e4a7326a7650ca3e3a6eaaa0e90375ee8281db32 119 115 2019-02-07T00:15:08Z Weiler 3 wikitext text/x-wiki __TOC__ == Getting Amazon Web Services Access == The Genomics Institute has a series of AWS Accounts that all support different projects. Often if you become associated with one or more of those projects, you will need access to that account or accounts. The way we are managing AWS IAM Account Access is that we have one AWS account that is the 'top level' account that everyone gets access to, and then, once you log in there, you can "Switch Role" into another sub-account that you are running things in. To get access, you will need your PI or Project Manager to email cluster-admin (cluster-admin@soe.ucsc.edu) asking for an AWS account for you, and also in that email to name the projects you will have access to. The cluster-admin group will contact you with your credentials to login. Once you login, you can change your password if you want to and also you will be able to set up MFA (Multi-Factor Authentication) for your account. You will be required to use MFA in order to "Switch Role" into any of the sub-accounts for the projects you are working on. The login URL to use when logging in to the top level account is listed below. The top level account is known as "gi-gateway": https://gi-gateway.signin.aws.amazon.com/console When you login, you will see a couple error messages on the AWS dashboard saying you don't have access to view certain resources - '''this is normal''', so just ignore the error messages. == Configuring Account Credentials == Once you login to the gi-gateway, you will have very few permissions to do anything there - which is normal, since you will not be working in that account anyway. The gi-gateway account is just there to authenticate you to AWS. '''Changing Your Password''' You can change your password by clicking on your username on the top right of the web browser window, just to the right of the little bell. If your username is bill@ucsc.edu, for example: * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window. * Click the '''"My Security Credentials"''' drop-down menu option. * Click the '''"Change Password"''' button to change your password. You will also need to configure MFA on your account before you will be allowed to switch roles into another account. '''Configuring MFA''' To configure MFA, the most common way to do it is to use '''Google Authenticator''', which is an app available for Apple and Android based cell phones and mobile devices. The app is free, simply download it from the app store to your cell phone or tablet to get started. Other MFA apps may also work but we have not tested everything out there. Once you have Google Authenticator installed, log into the gi-gateway account using the above URL, then: * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window (again, bill@ucsc.edu is an example). * Click the '''"My Security Credentials"''' drop-down menu option. * Scroll down to the MFA (Multi-Factor Authentication) section of the page, and click '''"Assign MFA Device"'''. * In the following menu select '''"Virtual MFA Device"'''. * In the following window click the '''"Show QR Code"''' link, and the MFA QR barcode will appear on your screen. * Open the Google Authenticator app on your mobile device, and click the little "+" symbol in the top right corner of the app to add an account. * You will then need to select "Scan Barcode" in the Google Authenticaor app to continue, and aim your mobile device camera at the QR barcode. * The new account MFA device should then be set up and you should see a 6 digit number with a small timer to the right of it. You must type in one 6 digit code that it displays into your web browser when asked, then wait for the next code to appear after the timer expires, and type that into the second field. It should then inform you that you have successfully associated an MFA device with your account. Once you have associated an MFA device with your AWS Account, log out, then log back in. It will ask for your username and password, and then ask for your MFA code, which you can view by opening Google Authenticator and seeing what code it is displaying at that time. The code changes every 30 seconds or so. == Switching Roles into Another AWS Account == Now that you have configured a password and enabled MFA, you will be allowed to "Switch Roles" into another account such that you can begin work there. The first time you switch roles into an account it will ask you a few questions, but subsequently it will remember which roles you have access to and they will become a menu item you can click on to quickly switch roles. Let's assume that you want to switch to the 'pangenomics' AWS account, and you have been already granted access to do so by the cluster-admin group. After logging into the 'gi-gateway' account at the URL listed here (same as above): https://gi-gateway.signin.aws.amazon.com/console Do the following to switch roles into the 'pangenomics' account (as an example): * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window (again, bill@ucsc.edu is an example). * Click the '''"Switch Role"''' option in the drop-down menu. * In the following menu it will ask you about the role you will be assuming. In our example we will use the following: -Account* = pangenomics -Role* = developer -Display Name = [leave blank, or use a short phrase] -Color = [choose a color for this role] * Then click the "Switch Role" button. If all went well you should be dumped into the 'pangenomics' account, and you should be identified in the top right hand corner of the page as '''"developer @ pangenomics"''', indicating your role and the account you are active in. You can then work as normal in that account. If you have not yet been given access to that role, you will receive an error message and not be allowed to switch roles. '''NOTE:''' When you switch roles, it may dump you into a region that you don't expect it to. Always verify the region you are in by looking at the top right of the web page - it will display your region there. Most of our stuff exists in "Oregon" (us-west-2), but some items appear in other regions on a per-case basis. If you wish to switch context back to the 'gi-gateway' account in order to manage something, or to switch to another role in another account, simple: * Click '''"develop @ pangenomics"''' in the top right corner of the window. * Select '''"Back to bill@ucsc.edu"''' You will then be sent back to the 'gi-gateway' context, and you can add another role to switch into, manage your credentials and further switch roles. == API Access and Secret Keys == If you require programmatic access to AWS, you will very likely be familiar with the AWS concept of Access Keys and Secret Keys, which can be used by scripts to authenticate yourself to AWS and use the APIs there without using the web console to authenticate. In the past, access keys and secret keys could be used by users with no further authentication. This introduces a security risk, as the management of those keys must be carefully guarded - if anyone gets your keys, they can rack up charges on your AWS account without your knowledge! Using the "Assume Role" mechanism we are now using, Access Keys and Secret Keys can still be created by users '''while logged into the gi-gateway account only'''. Do not try to create keys while you have "Switched Roles" into another account. Keys you create in the top level 'gi-gateway' account will work for you in any sub-account you have access to switch roles to. You will need to do a little more configuration for your keys to work from a UNIX command line however. When using the '''"aws"''' command line tool, assuming you have it installed (the process of which is outside the scope of this document), you would use the steps outlined in this document to configure it: https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html That document has a lot of other really useful information in it - if you plan on using keys for API access, we highly recommend reading it through. Generically, if you plan on using keys for API Access, minimally you will need to configure the "aws" utility and then tweak the config a bit for our setup. To start, run "aws configure". It should look something like this: $ aws configure AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE AWS Secret Access Key [None]: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY Default region name [None]: us-west-2 Default output format [None]: Most folks do that to start. It creates two files: ~/.aws/config ~/.aws/credentials Those two files are important to access AWS via the 'aws' command. '''~/.aws/credentials''' This file contains your access key and secret key, and should not need to be modified after running 'aws configure'. Your same keys can be used to access any roles in any accounts you have access to. '''~/.aws/config''' This file contains some account information you will need to tweak. You may want to configure something like this: [default] region = us-west-2 [profile pangenomics-developer] source_profile = default role_arn = arn:aws:iam::422448306679:role/developer mfa_serial = arn:aws:iam::652235167018:mfa/bill@ucsc.edu The 'role-arn' line contains the role and account number you are accessing. You can see a list of live account numbers here: [[AWS Account List and Numbers]]. Find the account number you need and enter it on the role_arn line, as well as the role name. You will get the role name from the cluster-admin group when you get access. The 'mda_serial' line contains the identifier for your MFA device. It will always look like '''"arn:aws:iam::652235167018:mfa/[your_iam_username]"'''. The account number listed there will always be '652235167018' because that is the account number of the top level 'gi-gateway' account. Once that is configured, you should be able to reference the profile you just created when using the aws command, like so: $ aws s3 ls --profile pangenomics-developer It will ask you for your MFA code and then run the command. Once you enter the MFA code, the token it creates will be valid by default for one hour, so you can run other 'aws' cli commands for one hour without the need to re-authenticate with MFA. After one hour, you will need to authenticate via MFA again. You can extend your session length from one hour to twelve hours but utilizing the AWS Security Token Service (AWS STS). See this page for more information on how to do this: https://docs.aws.amazon.com/cli/latest/reference/sts/assume-role.html#examples The examples at the bottom are particularly useful. bf758ffcdea566f910f0a7f218a4d3b8c72498d6 120 119 2019-02-07T00:26:41Z Weiler 3 wikitext text/x-wiki __TOC__ == Getting Amazon Web Services Access == The Genomics Institute has a series of AWS Accounts that all support different projects. Often if you become associated with one or more of those projects, you will need access to that account or accounts. The way we are managing AWS IAM Account Access is that we have one AWS account that is the 'top level' account that everyone gets access to, and then, once you log in there, you can "Switch Role" into another sub-account that you are running things in. To get access, you will need your PI or Project Manager to email cluster-admin (cluster-admin@soe.ucsc.edu) asking for an AWS account for you, and also in that email to name the projects you will have access to. The cluster-admin group will contact you with your credentials to login. Once you login, you can change your password if you want to and also you will be able to set up MFA (Multi-Factor Authentication) for your account. You will be required to use MFA in order to "Switch Role" into any of the sub-accounts for the projects you are working on. The login URL to use when logging in to the top level account is listed below. The top level account is known as "gi-gateway": https://gi-gateway.signin.aws.amazon.com/console When you login, you will see a couple error messages on the AWS dashboard saying you don't have access to view certain resources - '''this is normal''', so just ignore the error messages. == Configuring Account Credentials == Once you login to the gi-gateway, you will have very few permissions to do anything there - which is normal, since you will not be working in that account anyway. The gi-gateway account is just there to authenticate you to AWS. '''Changing Your Password''' You can change your password by clicking on your username on the top right of the web browser window, just to the right of the little bell. If your username is bill@ucsc.edu, for example: * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window. * Click the '''"My Security Credentials"''' drop-down menu option. * Click the '''"Change Password"''' button to change your password. You will also need to configure MFA on your account before you will be allowed to switch roles into another account. '''Configuring MFA''' To configure MFA, the most common way to do it is to use '''Google Authenticator''', which is an app available for Apple and Android based cell phones and mobile devices. The app is free, simply download it from the app store to your cell phone or tablet to get started. Other MFA apps may also work but we have not tested everything out there. Once you have Google Authenticator installed, log into the gi-gateway account using the above URL, then: * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window (again, bill@ucsc.edu is an example). * Click the '''"My Security Credentials"''' drop-down menu option. * Scroll down to the MFA (Multi-Factor Authentication) section of the page, and click '''"Assign MFA Device"'''. * In the following menu select '''"Virtual MFA Device"'''. * In the following window click the '''"Show QR Code"''' link, and the MFA QR barcode will appear on your screen. * Open the Google Authenticator app on your mobile device, and click the little "+" symbol in the top right corner of the app to add an account. * You will then need to select "Scan Barcode" in the Google Authenticaor app to continue, and aim your mobile device camera at the QR barcode. * The new account MFA device should then be set up and you should see a 6 digit number with a small timer to the right of it. You must type in one 6 digit code that it displays into your web browser when asked, then wait for the next code to appear after the timer expires, and type that into the second field. It should then inform you that you have successfully associated an MFA device with your account. Once you have associated an MFA device with the 'gi-gateway' Account, '''log out''', then log back in. It will ask for your username and password, and then ask for your MFA code, which you can view by opening Google Authenticator and seeing what code it is displaying at that time. The code changes every 30 seconds or so. '''You must log out first and log back in using MFA in order to be able to switch roles!!!''' == Switching Roles into Another AWS Account == Now that you have configured a password and enabled MFA, you will be allowed to "Switch Roles" into another account such that you can begin work there. The first time you switch roles into an account it will ask you a few questions, but subsequently it will remember which roles you have access to and they will become a menu item you can click on to quickly switch roles. Let's assume that you want to switch to the 'pangenomics' AWS account, and you have been already granted access to do so by the cluster-admin group. After logging into the 'gi-gateway' account at the URL listed here (same as above): https://gi-gateway.signin.aws.amazon.com/console Do the following to switch roles into the 'pangenomics' account (as an example): * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window (again, bill@ucsc.edu is an example). * Click the '''"Switch Role"''' option in the drop-down menu. * In the following menu it will ask you about the role you will be assuming. In our example we will use the following: -Account* = pangenomics -Role* = developer -Display Name = [leave blank, or use a short phrase] -Color = [choose a color for this role] * Then click the "Switch Role" button. If all went well you should be dumped into the 'pangenomics' account, and you should be identified in the top right hand corner of the page as '''"developer @ pangenomics"''', indicating your role and the account you are active in. You can then work as normal in that account. If you have not yet been given access to that role, you will receive an error message and not be allowed to switch roles. '''NOTE:''' When you switch roles, it may dump you into a region that you don't expect it to. Always verify the region you are in by looking at the top right of the web page - it will display your region there. Most of our stuff exists in "Oregon" (us-west-2), but some items appear in other regions on a per-case basis. If you wish to switch context back to the 'gi-gateway' account in order to manage something, or to switch to another role in another account, simple: * Click '''"develop @ pangenomics"''' in the top right corner of the window. * Select '''"Back to bill@ucsc.edu"''' You will then be sent back to the 'gi-gateway' context, and you can add another role to switch into, manage your credentials and further switch roles. == API Access and Secret Keys == If you require programmatic access to AWS, you will very likely be familiar with the AWS concept of Access Keys and Secret Keys, which can be used by scripts to authenticate yourself to AWS and use the APIs there without using the web console to authenticate. In the past, access keys and secret keys could be used by users with no further authentication. This introduces a security risk, as the management of those keys must be carefully guarded - if anyone gets your keys, they can rack up charges on your AWS account without your knowledge! Using the "Assume Role" mechanism we are now using, Access Keys and Secret Keys can still be created by users '''while logged into the gi-gateway account only'''. Do not try to create keys while you have "Switched Roles" into another account. Keys you create in the top level 'gi-gateway' account will work for you in any sub-account you have access to switch roles to. You will need to do a little more configuration for your keys to work from a UNIX command line however. When using the '''"aws"''' command line tool, assuming you have it installed (the process of which is outside the scope of this document), you would use the steps outlined in this document to configure it: https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html That document has a lot of other really useful information in it - if you plan on using keys for API access, we highly recommend reading it through. Generically, if you plan on using keys for API Access, minimally you will need to configure the "aws" utility and then tweak the config a bit for our setup. To start, run "aws configure". It should look something like this: $ aws configure AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE AWS Secret Access Key [None]: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY Default region name [None]: us-west-2 Default output format [None]: Most folks do that to start. It creates two files: ~/.aws/config ~/.aws/credentials Those two files are important to access AWS via the 'aws' command. '''~/.aws/credentials''' This file contains your access key and secret key, and should not need to be modified after running 'aws configure'. Your same keys can be used to access any roles in any accounts you have access to. '''~/.aws/config''' This file contains some account information you will need to tweak. You may want to configure something like this: [default] region = us-west-2 [profile pangenomics-developer] source_profile = default role_arn = arn:aws:iam::422448306679:role/developer mfa_serial = arn:aws:iam::652235167018:mfa/bill@ucsc.edu The 'role-arn' line contains the role and account number you are accessing. You can see a list of live account numbers here: [[AWS Account List and Numbers]]. Find the account number you need and enter it on the role_arn line, as well as the role name. You will get the role name from the cluster-admin group when you get access. The 'mda_serial' line contains the identifier for your MFA device. It will always look like '''"arn:aws:iam::652235167018:mfa/[your_iam_username]"'''. The account number listed there will always be '652235167018' because that is the account number of the top level 'gi-gateway' account. Once that is configured, you should be able to reference the profile you just created when using the aws command, like so: $ aws s3 ls --profile pangenomics-developer It will ask you for your MFA code and then run the command. Once you enter the MFA code, the token it creates will be valid by default for one hour, so you can run other 'aws' cli commands for one hour without the need to re-authenticate with MFA. After one hour, you will need to authenticate via MFA again. You can extend your session length from one hour to twelve hours but utilizing the AWS Security Token Service (AWS STS). See this page for more information on how to do this: https://docs.aws.amazon.com/cli/latest/reference/sts/assume-role.html#examples The examples at the bottom are particularly useful. dea266938d290c61fbe368c31955ab70ac2658af 121 120 2019-02-07T18:45:30Z Weiler 3 wikitext text/x-wiki __TOC__ == Getting Amazon Web Services Access == The Genomics Institute has a series of AWS Accounts that all support different projects. Often if you become associated with one or more of those projects, you will need access to that account or accounts. The way we are managing AWS IAM Account Access is that we have one AWS account that is the 'top level' account that everyone gets access to, and then, once you log in there, you can "Switch Role" into another sub-account that you are running things in. To get access, you will need your PI or Project Manager to email cluster-admin (cluster-admin@soe.ucsc.edu) asking for an AWS account for you, and also in that email to name the projects you will have access to. The cluster-admin group will contact you with your credentials to login. Once you login, you can change your password if you want to and also you will be able to set up MFA (Multi Factor Authentication) for your account. You will be required to use MFA in order to "Switch Role" into any of the sub-accounts for the projects you are working on. The login URL to use when logging in to the top level account is listed below. The top level account is known as "gi-gateway": https://gi-gateway.signin.aws.amazon.com/console When you login, you will see a couple error messages on the AWS dashboard saying you don't have access to view certain resources - '''this is normal''', so just ignore the error messages. == Configuring Account Credentials == Once you login to the gi-gateway, you will have very few permissions to do anything there - which is normal, since you will not be working in that account anyway. The gi-gateway account is just there to authenticate you to AWS. '''Changing Your Password''' You can change your password by clicking on your username on the top right of the web browser window, just to the right of the little bell. If your username is bill@ucsc.edu, for example: * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window. * Click the '''"My Security Credentials"''' drop-down menu option. * Click the '''"Change Password"''' button to change your password. You will also need to configure MFA on your account before you will be allowed to switch roles into another account. '''Configuring MFA''' To configure MFA (Multi Factor Authentication), the most common way to do it is to use '''Google Authenticator''', which is an app available for Apple and Android based cell phones and mobile devices. The app is free, simply download it from the app store to your cell phone or tablet to get started. Other MFA apps may also work but we have not tested everything out there. Once you have Google Authenticator installed, log into the gi-gateway account using the above URL, then: * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window (again, bill@ucsc.edu is an example). * Click the '''"My Security Credentials"''' drop-down menu option. * Scroll down to the MFA (Multi-Factor Authentication) section of the page, and click '''"Assign MFA Device"'''. * In the following menu select '''"Virtual MFA Device"'''. * In the following window click the '''"Show QR Code"''' link, and the MFA QR barcode will appear on your screen. * Open the Google Authenticator app on your mobile device, and click the little "+" symbol in the top right corner of the app to add an account. * You will then need to select "Scan Barcode" in the Google Authenticaor app to continue, and aim your mobile device camera at the QR barcode. * The new account MFA device should then be set up and you should see a 6 digit number with a small timer to the right of it. You must type in one 6 digit code that it displays into your web browser when asked, then wait for the next code to appear after the timer expires, and type that into the second field. It should then inform you that you have successfully associated an MFA device with your account. Once you have associated an MFA device with the 'gi-gateway' Account, '''log out''', then log back in. It will ask for your username and password, and then ask for your MFA code, which you can view by opening Google Authenticator and seeing what code it is displaying at that time. The code changes every 30 seconds or so. '''You must log out first and log back in using MFA in order to be able to switch roles!!!''' == Switching Roles into Another AWS Account == Now that you have configured a password and enabled MFA, you will be allowed to "Switch Roles" into another account such that you can begin work there. The first time you switch roles into an account it will ask you a few questions, but subsequently it will remember which roles you have access to and they will become a menu item you can click on to quickly switch roles. Let's assume that you want to switch to the 'pangenomics' AWS account, and you have been already granted access to do so by the cluster-admin group. After logging into the 'gi-gateway' account at the URL listed here (same as above): https://gi-gateway.signin.aws.amazon.com/console Do the following to switch roles into the 'pangenomics' account (as an example): * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window (again, bill@ucsc.edu is an example). * Click the '''"Switch Role"''' option in the drop-down menu. * In the following menu it will ask you about the role you will be assuming. In our example we will use the following: -Account* = pangenomics -Role* = developer -Display Name = [leave blank, or use a short phrase] -Color = [choose a color for this role] * Then click the "Switch Role" button. If all went well you should be dumped into the 'pangenomics' account, and you should be identified in the top right hand corner of the page as '''"developer @ pangenomics"''', indicating your role and the account you are active in. You can then work as normal in that account. If you have not yet been given access to that role, you will receive an error message and not be allowed to switch roles. '''NOTE:''' When you switch roles, it may dump you into a region that you don't expect it to. Always verify the region you are in by looking at the top right of the web page - it will display your region there. Most of our stuff exists in "Oregon" (us-west-2), but some items appear in other regions on a per-case basis. If you wish to switch context back to the 'gi-gateway' account in order to manage something, or to switch to another role in another account, simple: * Click '''"develop @ pangenomics"''' in the top right corner of the window. * Select '''"Back to bill@ucsc.edu"''' You will then be sent back to the 'gi-gateway' context, and you can add another role to switch into, manage your credentials and further switch roles. == API Access and Secret Keys == If you require programmatic access to AWS, you will very likely be familiar with the AWS concept of Access Keys and Secret Keys, which can be used by scripts to authenticate yourself to AWS and use the APIs there without using the web console to authenticate. In the past, access keys and secret keys could be used by users with no further authentication. This introduces a security risk, as the management of those keys must be carefully guarded - if anyone gets your keys, they can rack up charges on your AWS account without your knowledge! Using the "Assume Role" mechanism we are now using, Access Keys and Secret Keys can still be created by users '''while logged into the gi-gateway account only'''. Do not try to create keys while you have "Switched Roles" into another account. Keys you create in the top level 'gi-gateway' account will work for you in any sub-account you have access to switch roles to. You will need to do a little more configuration for your keys to work from a UNIX command line however. When using the '''"aws"''' command line tool, assuming you have it installed (the process of which is outside the scope of this document), you would use the steps outlined in this document to configure it: https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html That document has a lot of other really useful information in it - if you plan on using keys for API access, we highly recommend reading it through. Generically, if you plan on using keys for API Access, minimally you will need to configure the "aws" utility and then tweak the config a bit for our setup. To start, run "aws configure". It should look something like this: $ aws configure AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE AWS Secret Access Key [None]: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY Default region name [None]: us-west-2 Default output format [None]: Most folks do that to start. It creates two files: ~/.aws/config ~/.aws/credentials Those two files are important to access AWS via the 'aws' command. '''~/.aws/credentials''' This file contains your access key and secret key, and should not need to be modified after running 'aws configure'. Your same keys can be used to access any roles in any accounts you have access to. '''~/.aws/config''' This file contains some account information you will need to tweak. You may want to configure something like this: [default] region = us-west-2 [profile pangenomics-developer] source_profile = default role_arn = arn:aws:iam::422448306679:role/developer mfa_serial = arn:aws:iam::652235167018:mfa/bill@ucsc.edu The 'role-arn' line contains the role and account number you are accessing. You can see a list of live account numbers here: [[AWS Account List and Numbers]]. Find the account number you need and enter it on the role_arn line, as well as the role name. You will get the role name from the cluster-admin group when you get access. The 'mda_serial' line contains the identifier for your MFA device. It will always look like '''"arn:aws:iam::652235167018:mfa/[your_iam_username]"'''. The account number listed there will always be '652235167018' because that is the account number of the top level 'gi-gateway' account. Once that is configured, you should be able to reference the profile you just created when using the aws command, like so: $ aws s3 ls --profile pangenomics-developer It will ask you for your MFA code and then run the command. Once you enter the MFA code, the token it creates will be valid by default for one hour, so you can run other 'aws' cli commands for one hour without the need to re-authenticate with MFA. After one hour, you will need to authenticate via MFA again. You can extend your session length from one hour to twelve hours but utilizing the AWS Security Token Service (AWS STS). See this page for more information on how to do this: https://docs.aws.amazon.com/cli/latest/reference/sts/assume-role.html#examples The examples at the bottom are particularly useful. e95b2fc881afeb558e34986578cb22f082dd6760 122 121 2019-02-07T18:47:13Z Weiler 3 /* Switching Roles into Another AWS Account */ wikitext text/x-wiki __TOC__ == Getting Amazon Web Services Access == The Genomics Institute has a series of AWS Accounts that all support different projects. Often if you become associated with one or more of those projects, you will need access to that account or accounts. The way we are managing AWS IAM Account Access is that we have one AWS account that is the 'top level' account that everyone gets access to, and then, once you log in there, you can "Switch Role" into another sub-account that you are running things in. To get access, you will need your PI or Project Manager to email cluster-admin (cluster-admin@soe.ucsc.edu) asking for an AWS account for you, and also in that email to name the projects you will have access to. The cluster-admin group will contact you with your credentials to login. Once you login, you can change your password if you want to and also you will be able to set up MFA (Multi Factor Authentication) for your account. You will be required to use MFA in order to "Switch Role" into any of the sub-accounts for the projects you are working on. The login URL to use when logging in to the top level account is listed below. The top level account is known as "gi-gateway": https://gi-gateway.signin.aws.amazon.com/console When you login, you will see a couple error messages on the AWS dashboard saying you don't have access to view certain resources - '''this is normal''', so just ignore the error messages. == Configuring Account Credentials == Once you login to the gi-gateway, you will have very few permissions to do anything there - which is normal, since you will not be working in that account anyway. The gi-gateway account is just there to authenticate you to AWS. '''Changing Your Password''' You can change your password by clicking on your username on the top right of the web browser window, just to the right of the little bell. If your username is bill@ucsc.edu, for example: * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window. * Click the '''"My Security Credentials"''' drop-down menu option. * Click the '''"Change Password"''' button to change your password. You will also need to configure MFA on your account before you will be allowed to switch roles into another account. '''Configuring MFA''' To configure MFA (Multi Factor Authentication), the most common way to do it is to use '''Google Authenticator''', which is an app available for Apple and Android based cell phones and mobile devices. The app is free, simply download it from the app store to your cell phone or tablet to get started. Other MFA apps may also work but we have not tested everything out there. Once you have Google Authenticator installed, log into the gi-gateway account using the above URL, then: * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window (again, bill@ucsc.edu is an example). * Click the '''"My Security Credentials"''' drop-down menu option. * Scroll down to the MFA (Multi-Factor Authentication) section of the page, and click '''"Assign MFA Device"'''. * In the following menu select '''"Virtual MFA Device"'''. * In the following window click the '''"Show QR Code"''' link, and the MFA QR barcode will appear on your screen. * Open the Google Authenticator app on your mobile device, and click the little "+" symbol in the top right corner of the app to add an account. * You will then need to select "Scan Barcode" in the Google Authenticaor app to continue, and aim your mobile device camera at the QR barcode. * The new account MFA device should then be set up and you should see a 6 digit number with a small timer to the right of it. You must type in one 6 digit code that it displays into your web browser when asked, then wait for the next code to appear after the timer expires, and type that into the second field. It should then inform you that you have successfully associated an MFA device with your account. Once you have associated an MFA device with the 'gi-gateway' Account, '''log out''', then log back in. It will ask for your username and password, and then ask for your MFA code, which you can view by opening Google Authenticator and seeing what code it is displaying at that time. The code changes every 30 seconds or so. '''You must log out first and log back in using MFA in order to be able to switch roles!!!''' == Switching Roles into Another AWS Account == Now that you have configured a password and enabled MFA, you will be allowed to "Switch Roles" into another account such that you can begin work there. The first time you switch roles into an account it will ask you a few questions, but subsequently it will remember which roles you have access to and they will become a menu item you can click on to quickly switch roles. Let's assume that you want to switch to the 'pangenomics' AWS account, and you have been already granted access to do so by the cluster-admin group. After logging into the 'gi-gateway' account at the URL listed here (same as above): https://gi-gateway.signin.aws.amazon.com/console Do the following to switch roles into the 'pangenomics' account (as an example): * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window (again, bill@ucsc.edu is an example). * Click the '''"Switch Role"''' option in the drop-down menu. * In the following menu it will ask you about the role you will be assuming. In our example we will use the following: -Account* = pangenomics -Role* = developer -Display Name = [leave blank, or use a short phrase] -Color = [choose a color for this role] * Then click the "Switch Role" button. If all went well you should be dumped into the 'pangenomics' account, and you should be identified in the top right hand corner of the page as '''"developer @ pangenomics"''', indicating your role and the account you are active in. You can then work as normal in that account. If you have not yet been given access to that role, you will receive an error message and not be allowed to switch roles. '''NOTE:''' When you switch roles, it may dump you into a region that you don't expect it to. Always verify the region you are in by looking at the top right of the web page - it will display your region there. Most of our stuff exists in "Oregon" (us-west-2), but some items appear in other regions on a per-case basis. If you wish to switch context back to the 'gi-gateway' account in order to manage something, or to switch to another role in another account, simple: * Click '''"developer @ pangenomics"''' in the top right corner of the window. * Select '''"Back to bill@ucsc.edu"''' You will then be sent back to the 'gi-gateway' context, and you can add another role to switch into, manage your credentials and further switch roles. == API Access and Secret Keys == If you require programmatic access to AWS, you will very likely be familiar with the AWS concept of Access Keys and Secret Keys, which can be used by scripts to authenticate yourself to AWS and use the APIs there without using the web console to authenticate. In the past, access keys and secret keys could be used by users with no further authentication. This introduces a security risk, as the management of those keys must be carefully guarded - if anyone gets your keys, they can rack up charges on your AWS account without your knowledge! Using the "Assume Role" mechanism we are now using, Access Keys and Secret Keys can still be created by users '''while logged into the gi-gateway account only'''. Do not try to create keys while you have "Switched Roles" into another account. Keys you create in the top level 'gi-gateway' account will work for you in any sub-account you have access to switch roles to. You will need to do a little more configuration for your keys to work from a UNIX command line however. When using the '''"aws"''' command line tool, assuming you have it installed (the process of which is outside the scope of this document), you would use the steps outlined in this document to configure it: https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html That document has a lot of other really useful information in it - if you plan on using keys for API access, we highly recommend reading it through. Generically, if you plan on using keys for API Access, minimally you will need to configure the "aws" utility and then tweak the config a bit for our setup. To start, run "aws configure". It should look something like this: $ aws configure AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE AWS Secret Access Key [None]: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY Default region name [None]: us-west-2 Default output format [None]: Most folks do that to start. It creates two files: ~/.aws/config ~/.aws/credentials Those two files are important to access AWS via the 'aws' command. '''~/.aws/credentials''' This file contains your access key and secret key, and should not need to be modified after running 'aws configure'. Your same keys can be used to access any roles in any accounts you have access to. '''~/.aws/config''' This file contains some account information you will need to tweak. You may want to configure something like this: [default] region = us-west-2 [profile pangenomics-developer] source_profile = default role_arn = arn:aws:iam::422448306679:role/developer mfa_serial = arn:aws:iam::652235167018:mfa/bill@ucsc.edu The 'role-arn' line contains the role and account number you are accessing. You can see a list of live account numbers here: [[AWS Account List and Numbers]]. Find the account number you need and enter it on the role_arn line, as well as the role name. You will get the role name from the cluster-admin group when you get access. The 'mda_serial' line contains the identifier for your MFA device. It will always look like '''"arn:aws:iam::652235167018:mfa/[your_iam_username]"'''. The account number listed there will always be '652235167018' because that is the account number of the top level 'gi-gateway' account. Once that is configured, you should be able to reference the profile you just created when using the aws command, like so: $ aws s3 ls --profile pangenomics-developer It will ask you for your MFA code and then run the command. Once you enter the MFA code, the token it creates will be valid by default for one hour, so you can run other 'aws' cli commands for one hour without the need to re-authenticate with MFA. After one hour, you will need to authenticate via MFA again. You can extend your session length from one hour to twelve hours but utilizing the AWS Security Token Service (AWS STS). See this page for more information on how to do this: https://docs.aws.amazon.com/cli/latest/reference/sts/assume-role.html#examples The examples at the bottom are particularly useful. 41cad9b0af34b758e4147272e88db05616f2012e 123 122 2019-02-07T18:49:46Z Weiler 3 /* Getting Amazon Web Services Access */ wikitext text/x-wiki __TOC__ == Getting Amazon Web Services Access == The Genomics Institute has a series of AWS Accounts that all support different projects. Often if you become associated with one or more of those projects, you will need access to that account or accounts. The way we are managing AWS IAM Account Access is that we have one AWS account that is the 'top level' account that everyone gets access to, and then, once you log in there, you can "Switch Role" into another sub-account that you are running things in. To get access, you will need your PI or Project Manager to email cluster-admin (cluster-admin@soe.ucsc.edu) asking for an AWS account for you, and also in that email to name the projects you will have access to. The cluster-admin group will contact you with your credentials to login. Once you login, you can change your password if you want to and also you will be able to set up MFA (Multi Factor Authentication) for your account. You will be required to use MFA in order to "Switch Role" into any of the sub-accounts for the projects you are working on. The login URL to use when logging in to the top level account is listed below. The top level account is known as "gi-gateway": https://gi-gateway.signin.aws.amazon.com/console When you login, you '''may''' see a couple error messages on the AWS dashboard saying you don't have access to view certain resources - '''this is normal''', so just ignore the error messages. == Configuring Account Credentials == Once you login to the gi-gateway, you will have very few permissions to do anything there - which is normal, since you will not be working in that account anyway. The gi-gateway account is just there to authenticate you to AWS. '''Changing Your Password''' You can change your password by clicking on your username on the top right of the web browser window, just to the right of the little bell. If your username is bill@ucsc.edu, for example: * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window. * Click the '''"My Security Credentials"''' drop-down menu option. * Click the '''"Change Password"''' button to change your password. You will also need to configure MFA on your account before you will be allowed to switch roles into another account. '''Configuring MFA''' To configure MFA (Multi Factor Authentication), the most common way to do it is to use '''Google Authenticator''', which is an app available for Apple and Android based cell phones and mobile devices. The app is free, simply download it from the app store to your cell phone or tablet to get started. Other MFA apps may also work but we have not tested everything out there. Once you have Google Authenticator installed, log into the gi-gateway account using the above URL, then: * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window (again, bill@ucsc.edu is an example). * Click the '''"My Security Credentials"''' drop-down menu option. * Scroll down to the MFA (Multi-Factor Authentication) section of the page, and click '''"Assign MFA Device"'''. * In the following menu select '''"Virtual MFA Device"'''. * In the following window click the '''"Show QR Code"''' link, and the MFA QR barcode will appear on your screen. * Open the Google Authenticator app on your mobile device, and click the little "+" symbol in the top right corner of the app to add an account. * You will then need to select "Scan Barcode" in the Google Authenticaor app to continue, and aim your mobile device camera at the QR barcode. * The new account MFA device should then be set up and you should see a 6 digit number with a small timer to the right of it. You must type in one 6 digit code that it displays into your web browser when asked, then wait for the next code to appear after the timer expires, and type that into the second field. It should then inform you that you have successfully associated an MFA device with your account. Once you have associated an MFA device with the 'gi-gateway' Account, '''log out''', then log back in. It will ask for your username and password, and then ask for your MFA code, which you can view by opening Google Authenticator and seeing what code it is displaying at that time. The code changes every 30 seconds or so. '''You must log out first and log back in using MFA in order to be able to switch roles!!!''' == Switching Roles into Another AWS Account == Now that you have configured a password and enabled MFA, you will be allowed to "Switch Roles" into another account such that you can begin work there. The first time you switch roles into an account it will ask you a few questions, but subsequently it will remember which roles you have access to and they will become a menu item you can click on to quickly switch roles. Let's assume that you want to switch to the 'pangenomics' AWS account, and you have been already granted access to do so by the cluster-admin group. After logging into the 'gi-gateway' account at the URL listed here (same as above): https://gi-gateway.signin.aws.amazon.com/console Do the following to switch roles into the 'pangenomics' account (as an example): * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window (again, bill@ucsc.edu is an example). * Click the '''"Switch Role"''' option in the drop-down menu. * In the following menu it will ask you about the role you will be assuming. In our example we will use the following: -Account* = pangenomics -Role* = developer -Display Name = [leave blank, or use a short phrase] -Color = [choose a color for this role] * Then click the "Switch Role" button. If all went well you should be dumped into the 'pangenomics' account, and you should be identified in the top right hand corner of the page as '''"developer @ pangenomics"''', indicating your role and the account you are active in. You can then work as normal in that account. If you have not yet been given access to that role, you will receive an error message and not be allowed to switch roles. '''NOTE:''' When you switch roles, it may dump you into a region that you don't expect it to. Always verify the region you are in by looking at the top right of the web page - it will display your region there. Most of our stuff exists in "Oregon" (us-west-2), but some items appear in other regions on a per-case basis. If you wish to switch context back to the 'gi-gateway' account in order to manage something, or to switch to another role in another account, simple: * Click '''"developer @ pangenomics"''' in the top right corner of the window. * Select '''"Back to bill@ucsc.edu"''' You will then be sent back to the 'gi-gateway' context, and you can add another role to switch into, manage your credentials and further switch roles. == API Access and Secret Keys == If you require programmatic access to AWS, you will very likely be familiar with the AWS concept of Access Keys and Secret Keys, which can be used by scripts to authenticate yourself to AWS and use the APIs there without using the web console to authenticate. In the past, access keys and secret keys could be used by users with no further authentication. This introduces a security risk, as the management of those keys must be carefully guarded - if anyone gets your keys, they can rack up charges on your AWS account without your knowledge! Using the "Assume Role" mechanism we are now using, Access Keys and Secret Keys can still be created by users '''while logged into the gi-gateway account only'''. Do not try to create keys while you have "Switched Roles" into another account. Keys you create in the top level 'gi-gateway' account will work for you in any sub-account you have access to switch roles to. You will need to do a little more configuration for your keys to work from a UNIX command line however. When using the '''"aws"''' command line tool, assuming you have it installed (the process of which is outside the scope of this document), you would use the steps outlined in this document to configure it: https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html That document has a lot of other really useful information in it - if you plan on using keys for API access, we highly recommend reading it through. Generically, if you plan on using keys for API Access, minimally you will need to configure the "aws" utility and then tweak the config a bit for our setup. To start, run "aws configure". It should look something like this: $ aws configure AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE AWS Secret Access Key [None]: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY Default region name [None]: us-west-2 Default output format [None]: Most folks do that to start. It creates two files: ~/.aws/config ~/.aws/credentials Those two files are important to access AWS via the 'aws' command. '''~/.aws/credentials''' This file contains your access key and secret key, and should not need to be modified after running 'aws configure'. Your same keys can be used to access any roles in any accounts you have access to. '''~/.aws/config''' This file contains some account information you will need to tweak. You may want to configure something like this: [default] region = us-west-2 [profile pangenomics-developer] source_profile = default role_arn = arn:aws:iam::422448306679:role/developer mfa_serial = arn:aws:iam::652235167018:mfa/bill@ucsc.edu The 'role-arn' line contains the role and account number you are accessing. You can see a list of live account numbers here: [[AWS Account List and Numbers]]. Find the account number you need and enter it on the role_arn line, as well as the role name. You will get the role name from the cluster-admin group when you get access. The 'mda_serial' line contains the identifier for your MFA device. It will always look like '''"arn:aws:iam::652235167018:mfa/[your_iam_username]"'''. The account number listed there will always be '652235167018' because that is the account number of the top level 'gi-gateway' account. Once that is configured, you should be able to reference the profile you just created when using the aws command, like so: $ aws s3 ls --profile pangenomics-developer It will ask you for your MFA code and then run the command. Once you enter the MFA code, the token it creates will be valid by default for one hour, so you can run other 'aws' cli commands for one hour without the need to re-authenticate with MFA. After one hour, you will need to authenticate via MFA again. You can extend your session length from one hour to twelve hours but utilizing the AWS Security Token Service (AWS STS). See this page for more information on how to do this: https://docs.aws.amazon.com/cli/latest/reference/sts/assume-role.html#examples The examples at the bottom are particularly useful. 24f088b22a7e64aea8e5c45818ce4678c1f9e227 124 123 2019-02-07T18:50:38Z Weiler 3 /* API Access and Secret Keys */ wikitext text/x-wiki __TOC__ == Getting Amazon Web Services Access == The Genomics Institute has a series of AWS Accounts that all support different projects. Often if you become associated with one or more of those projects, you will need access to that account or accounts. The way we are managing AWS IAM Account Access is that we have one AWS account that is the 'top level' account that everyone gets access to, and then, once you log in there, you can "Switch Role" into another sub-account that you are running things in. To get access, you will need your PI or Project Manager to email cluster-admin (cluster-admin@soe.ucsc.edu) asking for an AWS account for you, and also in that email to name the projects you will have access to. The cluster-admin group will contact you with your credentials to login. Once you login, you can change your password if you want to and also you will be able to set up MFA (Multi Factor Authentication) for your account. You will be required to use MFA in order to "Switch Role" into any of the sub-accounts for the projects you are working on. The login URL to use when logging in to the top level account is listed below. The top level account is known as "gi-gateway": https://gi-gateway.signin.aws.amazon.com/console When you login, you '''may''' see a couple error messages on the AWS dashboard saying you don't have access to view certain resources - '''this is normal''', so just ignore the error messages. == Configuring Account Credentials == Once you login to the gi-gateway, you will have very few permissions to do anything there - which is normal, since you will not be working in that account anyway. The gi-gateway account is just there to authenticate you to AWS. '''Changing Your Password''' You can change your password by clicking on your username on the top right of the web browser window, just to the right of the little bell. If your username is bill@ucsc.edu, for example: * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window. * Click the '''"My Security Credentials"''' drop-down menu option. * Click the '''"Change Password"''' button to change your password. You will also need to configure MFA on your account before you will be allowed to switch roles into another account. '''Configuring MFA''' To configure MFA (Multi Factor Authentication), the most common way to do it is to use '''Google Authenticator''', which is an app available for Apple and Android based cell phones and mobile devices. The app is free, simply download it from the app store to your cell phone or tablet to get started. Other MFA apps may also work but we have not tested everything out there. Once you have Google Authenticator installed, log into the gi-gateway account using the above URL, then: * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window (again, bill@ucsc.edu is an example). * Click the '''"My Security Credentials"''' drop-down menu option. * Scroll down to the MFA (Multi-Factor Authentication) section of the page, and click '''"Assign MFA Device"'''. * In the following menu select '''"Virtual MFA Device"'''. * In the following window click the '''"Show QR Code"''' link, and the MFA QR barcode will appear on your screen. * Open the Google Authenticator app on your mobile device, and click the little "+" symbol in the top right corner of the app to add an account. * You will then need to select "Scan Barcode" in the Google Authenticaor app to continue, and aim your mobile device camera at the QR barcode. * The new account MFA device should then be set up and you should see a 6 digit number with a small timer to the right of it. You must type in one 6 digit code that it displays into your web browser when asked, then wait for the next code to appear after the timer expires, and type that into the second field. It should then inform you that you have successfully associated an MFA device with your account. Once you have associated an MFA device with the 'gi-gateway' Account, '''log out''', then log back in. It will ask for your username and password, and then ask for your MFA code, which you can view by opening Google Authenticator and seeing what code it is displaying at that time. The code changes every 30 seconds or so. '''You must log out first and log back in using MFA in order to be able to switch roles!!!''' == Switching Roles into Another AWS Account == Now that you have configured a password and enabled MFA, you will be allowed to "Switch Roles" into another account such that you can begin work there. The first time you switch roles into an account it will ask you a few questions, but subsequently it will remember which roles you have access to and they will become a menu item you can click on to quickly switch roles. Let's assume that you want to switch to the 'pangenomics' AWS account, and you have been already granted access to do so by the cluster-admin group. After logging into the 'gi-gateway' account at the URL listed here (same as above): https://gi-gateway.signin.aws.amazon.com/console Do the following to switch roles into the 'pangenomics' account (as an example): * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window (again, bill@ucsc.edu is an example). * Click the '''"Switch Role"''' option in the drop-down menu. * In the following menu it will ask you about the role you will be assuming. In our example we will use the following: -Account* = pangenomics -Role* = developer -Display Name = [leave blank, or use a short phrase] -Color = [choose a color for this role] * Then click the "Switch Role" button. If all went well you should be dumped into the 'pangenomics' account, and you should be identified in the top right hand corner of the page as '''"developer @ pangenomics"''', indicating your role and the account you are active in. You can then work as normal in that account. If you have not yet been given access to that role, you will receive an error message and not be allowed to switch roles. '''NOTE:''' When you switch roles, it may dump you into a region that you don't expect it to. Always verify the region you are in by looking at the top right of the web page - it will display your region there. Most of our stuff exists in "Oregon" (us-west-2), but some items appear in other regions on a per-case basis. If you wish to switch context back to the 'gi-gateway' account in order to manage something, or to switch to another role in another account, simple: * Click '''"developer @ pangenomics"''' in the top right corner of the window. * Select '''"Back to bill@ucsc.edu"''' You will then be sent back to the 'gi-gateway' context, and you can add another role to switch into, manage your credentials and further switch roles. == API Access and Secret Keys == If you require programmatic access to AWS, you will very likely be familiar with the AWS concept of Access Keys and Secret Keys, which can be used by scripts to authenticate yourself to AWS and use the APIs there without using the web console to authenticate. In the past, access keys and secret keys could be used by users with no further authentication. This introduces a security risk, as the management of those keys must be carefully guarded - if anyone gets your keys, they can rack up charges on your AWS account without your knowledge! Using the "Assume Role" mechanism we are now using, Access Keys and Secret Keys can still be created by users '''while logged into the gi-gateway account only'''. Do not try to create keys while you have "Switched Roles" into another account. Keys you create in the top level 'gi-gateway' account will work for you in any sub-account you have access to switch roles to. You will need to do a little more configuration for your keys to work from a UNIX command line however. When using the '''"aws"''' command line tool, assuming you have it installed (the process of which is outside the scope of this document), you would use the steps outlined in this document to configure it: https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html That document has a lot of other really useful information in it - if you plan on using keys for API access, we highly recommend reading it through. Generically, if you plan on using keys for API Access, minimally you will need to configure the "aws" utility and then tweak the config a bit for our setup. To start, run "aws configure". It should look something like this: $ aws configure AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE AWS Secret Access Key [None]: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY Default region name [None]: us-west-2 Default output format [None]: Most folks do that to start. It creates two files: ~/.aws/config ~/.aws/credentials Those two files are important to access AWS via the 'aws' command. '''~/.aws/credentials''' This file contains your access key and secret key, and should not need to be modified after running 'aws configure'. Your same keys can be used to access any roles in any accounts you have access to. '''~/.aws/config''' This file contains some account information you will need to tweak. You may want to configure something like this: [default] region = us-west-2 [profile pangenomics-developer] source_profile = default role_arn = arn:aws:iam::422448306679:role/developer mfa_serial = arn:aws:iam::652235167018:mfa/bill@ucsc.edu The 'role-arn' line contains the role and account number you are accessing. You can see a list of live account numbers here: [[AWS Account List and Numbers]]. Find the account number you need and enter it on the role_arn line, as well as the role name. You will get the role name from the cluster-admin group when you get access. The 'mfa_serial' line contains the identifier for your MFA device. It will always look like '''"arn:aws:iam::652235167018:mfa/[your_iam_username]"'''. The account number listed there will always be '652235167018' because that is the account number of the top level 'gi-gateway' account. Once that is configured, you should be able to reference the profile you just created when using the aws command, like so: $ aws s3 ls --profile pangenomics-developer It will ask you for your MFA code and then run the command. Once you enter the MFA code, the token it creates will be valid by default for one hour, so you can run other 'aws' cli commands for one hour without the need to re-authenticate with MFA. After one hour, you will need to authenticate via MFA again. You can extend your session length from one hour to twelve hours but utilizing the AWS Security Token Service (AWS STS). See this page for more information on how to do this: https://docs.aws.amazon.com/cli/latest/reference/sts/assume-role.html#examples The examples at the bottom are particularly useful. dbe617d72e95782abc2fcb36cb626525908e5740 125 124 2019-02-07T18:51:05Z Weiler 3 /* API Access and Secret Keys */ wikitext text/x-wiki __TOC__ == Getting Amazon Web Services Access == The Genomics Institute has a series of AWS Accounts that all support different projects. Often if you become associated with one or more of those projects, you will need access to that account or accounts. The way we are managing AWS IAM Account Access is that we have one AWS account that is the 'top level' account that everyone gets access to, and then, once you log in there, you can "Switch Role" into another sub-account that you are running things in. To get access, you will need your PI or Project Manager to email cluster-admin (cluster-admin@soe.ucsc.edu) asking for an AWS account for you, and also in that email to name the projects you will have access to. The cluster-admin group will contact you with your credentials to login. Once you login, you can change your password if you want to and also you will be able to set up MFA (Multi Factor Authentication) for your account. You will be required to use MFA in order to "Switch Role" into any of the sub-accounts for the projects you are working on. The login URL to use when logging in to the top level account is listed below. The top level account is known as "gi-gateway": https://gi-gateway.signin.aws.amazon.com/console When you login, you '''may''' see a couple error messages on the AWS dashboard saying you don't have access to view certain resources - '''this is normal''', so just ignore the error messages. == Configuring Account Credentials == Once you login to the gi-gateway, you will have very few permissions to do anything there - which is normal, since you will not be working in that account anyway. The gi-gateway account is just there to authenticate you to AWS. '''Changing Your Password''' You can change your password by clicking on your username on the top right of the web browser window, just to the right of the little bell. If your username is bill@ucsc.edu, for example: * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window. * Click the '''"My Security Credentials"''' drop-down menu option. * Click the '''"Change Password"''' button to change your password. You will also need to configure MFA on your account before you will be allowed to switch roles into another account. '''Configuring MFA''' To configure MFA (Multi Factor Authentication), the most common way to do it is to use '''Google Authenticator''', which is an app available for Apple and Android based cell phones and mobile devices. The app is free, simply download it from the app store to your cell phone or tablet to get started. Other MFA apps may also work but we have not tested everything out there. Once you have Google Authenticator installed, log into the gi-gateway account using the above URL, then: * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window (again, bill@ucsc.edu is an example). * Click the '''"My Security Credentials"''' drop-down menu option. * Scroll down to the MFA (Multi-Factor Authentication) section of the page, and click '''"Assign MFA Device"'''. * In the following menu select '''"Virtual MFA Device"'''. * In the following window click the '''"Show QR Code"''' link, and the MFA QR barcode will appear on your screen. * Open the Google Authenticator app on your mobile device, and click the little "+" symbol in the top right corner of the app to add an account. * You will then need to select "Scan Barcode" in the Google Authenticaor app to continue, and aim your mobile device camera at the QR barcode. * The new account MFA device should then be set up and you should see a 6 digit number with a small timer to the right of it. You must type in one 6 digit code that it displays into your web browser when asked, then wait for the next code to appear after the timer expires, and type that into the second field. It should then inform you that you have successfully associated an MFA device with your account. Once you have associated an MFA device with the 'gi-gateway' Account, '''log out''', then log back in. It will ask for your username and password, and then ask for your MFA code, which you can view by opening Google Authenticator and seeing what code it is displaying at that time. The code changes every 30 seconds or so. '''You must log out first and log back in using MFA in order to be able to switch roles!!!''' == Switching Roles into Another AWS Account == Now that you have configured a password and enabled MFA, you will be allowed to "Switch Roles" into another account such that you can begin work there. The first time you switch roles into an account it will ask you a few questions, but subsequently it will remember which roles you have access to and they will become a menu item you can click on to quickly switch roles. Let's assume that you want to switch to the 'pangenomics' AWS account, and you have been already granted access to do so by the cluster-admin group. After logging into the 'gi-gateway' account at the URL listed here (same as above): https://gi-gateway.signin.aws.amazon.com/console Do the following to switch roles into the 'pangenomics' account (as an example): * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window (again, bill@ucsc.edu is an example). * Click the '''"Switch Role"''' option in the drop-down menu. * In the following menu it will ask you about the role you will be assuming. In our example we will use the following: -Account* = pangenomics -Role* = developer -Display Name = [leave blank, or use a short phrase] -Color = [choose a color for this role] * Then click the "Switch Role" button. If all went well you should be dumped into the 'pangenomics' account, and you should be identified in the top right hand corner of the page as '''"developer @ pangenomics"''', indicating your role and the account you are active in. You can then work as normal in that account. If you have not yet been given access to that role, you will receive an error message and not be allowed to switch roles. '''NOTE:''' When you switch roles, it may dump you into a region that you don't expect it to. Always verify the region you are in by looking at the top right of the web page - it will display your region there. Most of our stuff exists in "Oregon" (us-west-2), but some items appear in other regions on a per-case basis. If you wish to switch context back to the 'gi-gateway' account in order to manage something, or to switch to another role in another account, simple: * Click '''"developer @ pangenomics"''' in the top right corner of the window. * Select '''"Back to bill@ucsc.edu"''' You will then be sent back to the 'gi-gateway' context, and you can add another role to switch into, manage your credentials and further switch roles. == API Access and Secret Keys == If you require programmatic access to AWS, you will very likely be familiar with the AWS concept of Access Keys and Secret Keys, which can be used by scripts to authenticate yourself to AWS and use the APIs there without using the web console to authenticate. In the past, access keys and secret keys could be used by users with no further authentication. This introduces a security risk, as the management of those keys must be carefully guarded - if anyone gets your keys, they can rack up charges on your AWS account without your knowledge! Using the "Assume Role" mechanism we are now using, Access Keys and Secret Keys can still be created by users '''while logged into the gi-gateway account only'''. Do not try to create keys while you have "Switched Roles" into another account. Keys you create in the top level 'gi-gateway' account will work for you in any sub-account you have access to switch roles to. You will need to do a little more configuration for your keys to work from a UNIX command line however. When using the '''"aws"''' command line tool, assuming you have it installed (the process of which is outside the scope of this document), you would use the steps outlined in this document to configure it: https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html That document has a lot of other really useful information in it - if you plan on using keys for API access, we highly recommend reading it through. Generically, if you plan on using keys for API Access, minimally you will need to configure the "aws" utility and then tweak the config a bit for our setup. To start, run "aws configure". It should look something like this: $ aws configure AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE AWS Secret Access Key [None]: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY Default region name [None]: us-west-2 Default output format [None]: Most folks do that to start. It creates two files: ~/.aws/config ~/.aws/credentials Those two files are important to access AWS via the 'aws' command. '''~/.aws/credentials''' This file contains your access key and secret key, and should not need to be modified after running 'aws configure'. Your same keys can be used to access any roles in any accounts you have access to. '''~/.aws/config''' This file contains some account information you will need to tweak. You may want to configure something like this: [default] region = us-west-2 [profile pangenomics-developer] source_profile = default role_arn = arn:aws:iam::422448306679:role/developer mfa_serial = arn:aws:iam::652235167018:mfa/bill@ucsc.edu The 'role_arn' line contains the role and account number you are accessing. You can see a list of live account numbers here: [[AWS Account List and Numbers]]. Find the account number you need and enter it on the role_arn line, as well as the role name. You will get the role name from the cluster-admin group when you get access. The 'mfa_serial' line contains the identifier for your MFA device. It will always look like '''"arn:aws:iam::652235167018:mfa/[your_iam_username]"'''. The account number listed there will always be '652235167018' because that is the account number of the top level 'gi-gateway' account. Once that is configured, you should be able to reference the profile you just created when using the aws command, like so: $ aws s3 ls --profile pangenomics-developer It will ask you for your MFA code and then run the command. Once you enter the MFA code, the token it creates will be valid by default for one hour, so you can run other 'aws' cli commands for one hour without the need to re-authenticate with MFA. After one hour, you will need to authenticate via MFA again. You can extend your session length from one hour to twelve hours but utilizing the AWS Security Token Service (AWS STS). See this page for more information on how to do this: https://docs.aws.amazon.com/cli/latest/reference/sts/assume-role.html#examples The examples at the bottom are particularly useful. 59b0738ec9ddc5a5eab6377f68270f5dd77b90eb 127 125 2019-02-07T18:55:06Z Weiler 3 /* Getting Amazon Web Services Access */ wikitext text/x-wiki __TOC__ == Getting AWS (Amazon Web Services) Access == The Genomics Institute has a series of AWS Accounts that all support different projects. Often if you become associated with one or more of those projects, you will need access to that account or accounts. The way we are managing AWS IAM Account Access is that we have one AWS account that is the 'top level' account that everyone gets access to, and then, once you log in there, you can "Switch Role" into another sub-account that you are running things in. To get access, you will need your PI or Project Manager to email cluster-admin (cluster-admin@soe.ucsc.edu) asking for an AWS account for you, and also in that email to name the projects you will have access to. The cluster-admin group will contact you with your credentials to login. Once you login, you can change your password if you want to and also you will be able to set up MFA (Multi Factor Authentication) for your account. You will be required to use MFA in order to "Switch Role" into any of the sub-accounts for the projects you are working on. The login URL to use when logging in to the top level account is listed below. The top level account is known as "gi-gateway": https://gi-gateway.signin.aws.amazon.com/console When you login, you '''may''' see a couple error messages on the AWS dashboard saying you don't have access to view certain resources - '''this is normal''', so just ignore the error messages. == Configuring Account Credentials == Once you login to the gi-gateway, you will have very few permissions to do anything there - which is normal, since you will not be working in that account anyway. The gi-gateway account is just there to authenticate you to AWS. '''Changing Your Password''' You can change your password by clicking on your username on the top right of the web browser window, just to the right of the little bell. If your username is bill@ucsc.edu, for example: * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window. * Click the '''"My Security Credentials"''' drop-down menu option. * Click the '''"Change Password"''' button to change your password. You will also need to configure MFA on your account before you will be allowed to switch roles into another account. '''Configuring MFA''' To configure MFA (Multi Factor Authentication), the most common way to do it is to use '''Google Authenticator''', which is an app available for Apple and Android based cell phones and mobile devices. The app is free, simply download it from the app store to your cell phone or tablet to get started. Other MFA apps may also work but we have not tested everything out there. Once you have Google Authenticator installed, log into the gi-gateway account using the above URL, then: * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window (again, bill@ucsc.edu is an example). * Click the '''"My Security Credentials"''' drop-down menu option. * Scroll down to the MFA (Multi-Factor Authentication) section of the page, and click '''"Assign MFA Device"'''. * In the following menu select '''"Virtual MFA Device"'''. * In the following window click the '''"Show QR Code"''' link, and the MFA QR barcode will appear on your screen. * Open the Google Authenticator app on your mobile device, and click the little "+" symbol in the top right corner of the app to add an account. * You will then need to select "Scan Barcode" in the Google Authenticaor app to continue, and aim your mobile device camera at the QR barcode. * The new account MFA device should then be set up and you should see a 6 digit number with a small timer to the right of it. You must type in one 6 digit code that it displays into your web browser when asked, then wait for the next code to appear after the timer expires, and type that into the second field. It should then inform you that you have successfully associated an MFA device with your account. Once you have associated an MFA device with the 'gi-gateway' Account, '''log out''', then log back in. It will ask for your username and password, and then ask for your MFA code, which you can view by opening Google Authenticator and seeing what code it is displaying at that time. The code changes every 30 seconds or so. '''You must log out first and log back in using MFA in order to be able to switch roles!!!''' == Switching Roles into Another AWS Account == Now that you have configured a password and enabled MFA, you will be allowed to "Switch Roles" into another account such that you can begin work there. The first time you switch roles into an account it will ask you a few questions, but subsequently it will remember which roles you have access to and they will become a menu item you can click on to quickly switch roles. Let's assume that you want to switch to the 'pangenomics' AWS account, and you have been already granted access to do so by the cluster-admin group. After logging into the 'gi-gateway' account at the URL listed here (same as above): https://gi-gateway.signin.aws.amazon.com/console Do the following to switch roles into the 'pangenomics' account (as an example): * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window (again, bill@ucsc.edu is an example). * Click the '''"Switch Role"''' option in the drop-down menu. * In the following menu it will ask you about the role you will be assuming. In our example we will use the following: -Account* = pangenomics -Role* = developer -Display Name = [leave blank, or use a short phrase] -Color = [choose a color for this role] * Then click the "Switch Role" button. If all went well you should be dumped into the 'pangenomics' account, and you should be identified in the top right hand corner of the page as '''"developer @ pangenomics"''', indicating your role and the account you are active in. You can then work as normal in that account. If you have not yet been given access to that role, you will receive an error message and not be allowed to switch roles. '''NOTE:''' When you switch roles, it may dump you into a region that you don't expect it to. Always verify the region you are in by looking at the top right of the web page - it will display your region there. Most of our stuff exists in "Oregon" (us-west-2), but some items appear in other regions on a per-case basis. If you wish to switch context back to the 'gi-gateway' account in order to manage something, or to switch to another role in another account, simple: * Click '''"developer @ pangenomics"''' in the top right corner of the window. * Select '''"Back to bill@ucsc.edu"''' You will then be sent back to the 'gi-gateway' context, and you can add another role to switch into, manage your credentials and further switch roles. == API Access and Secret Keys == If you require programmatic access to AWS, you will very likely be familiar with the AWS concept of Access Keys and Secret Keys, which can be used by scripts to authenticate yourself to AWS and use the APIs there without using the web console to authenticate. In the past, access keys and secret keys could be used by users with no further authentication. This introduces a security risk, as the management of those keys must be carefully guarded - if anyone gets your keys, they can rack up charges on your AWS account without your knowledge! Using the "Assume Role" mechanism we are now using, Access Keys and Secret Keys can still be created by users '''while logged into the gi-gateway account only'''. Do not try to create keys while you have "Switched Roles" into another account. Keys you create in the top level 'gi-gateway' account will work for you in any sub-account you have access to switch roles to. You will need to do a little more configuration for your keys to work from a UNIX command line however. When using the '''"aws"''' command line tool, assuming you have it installed (the process of which is outside the scope of this document), you would use the steps outlined in this document to configure it: https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html That document has a lot of other really useful information in it - if you plan on using keys for API access, we highly recommend reading it through. Generically, if you plan on using keys for API Access, minimally you will need to configure the "aws" utility and then tweak the config a bit for our setup. To start, run "aws configure". It should look something like this: $ aws configure AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE AWS Secret Access Key [None]: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY Default region name [None]: us-west-2 Default output format [None]: Most folks do that to start. It creates two files: ~/.aws/config ~/.aws/credentials Those two files are important to access AWS via the 'aws' command. '''~/.aws/credentials''' This file contains your access key and secret key, and should not need to be modified after running 'aws configure'. Your same keys can be used to access any roles in any accounts you have access to. '''~/.aws/config''' This file contains some account information you will need to tweak. You may want to configure something like this: [default] region = us-west-2 [profile pangenomics-developer] source_profile = default role_arn = arn:aws:iam::422448306679:role/developer mfa_serial = arn:aws:iam::652235167018:mfa/bill@ucsc.edu The 'role_arn' line contains the role and account number you are accessing. You can see a list of live account numbers here: [[AWS Account List and Numbers]]. Find the account number you need and enter it on the role_arn line, as well as the role name. You will get the role name from the cluster-admin group when you get access. The 'mfa_serial' line contains the identifier for your MFA device. It will always look like '''"arn:aws:iam::652235167018:mfa/[your_iam_username]"'''. The account number listed there will always be '652235167018' because that is the account number of the top level 'gi-gateway' account. Once that is configured, you should be able to reference the profile you just created when using the aws command, like so: $ aws s3 ls --profile pangenomics-developer It will ask you for your MFA code and then run the command. Once you enter the MFA code, the token it creates will be valid by default for one hour, so you can run other 'aws' cli commands for one hour without the need to re-authenticate with MFA. After one hour, you will need to authenticate via MFA again. You can extend your session length from one hour to twelve hours but utilizing the AWS Security Token Service (AWS STS). See this page for more information on how to do this: https://docs.aws.amazon.com/cli/latest/reference/sts/assume-role.html#examples The examples at the bottom are particularly useful. de18dcbdf677a940637f81cae2dbd52a96532dc2 128 127 2019-02-08T18:51:33Z Weiler 3 wikitext text/x-wiki __TOC__ == Getting AWS (Amazon Web Services) Access == The Genomics Institute has a series of AWS Accounts that all support different projects. Often if you become associated with one or more of those projects, you will need access to that account or accounts. The way we are managing AWS IAM Account Access is that we have one AWS account that is the 'top level' account that everyone gets access to, and then, once you log in there, you can "Switch Role" into another sub-account that you are running things in. To get access, you will need your PI or Project Manager to email cluster-admin (cluster-admin@soe.ucsc.edu) asking for an AWS account for you, and also in that email to name the projects you will have access to. The cluster-admin group will contact you with your credentials to login. Once you login, you can change your password if you want to and also you will be able to set up MFA (Multi Factor Authentication) for your account. You will be required to use MFA in order to "Switch Role" into any of the sub-accounts for the projects you are working on. The login URL to use when logging in to the top level account is listed below. The top level account is known as "gi-gateway": https://gi-gateway.signin.aws.amazon.com/console When you login, you '''may''' see a couple error messages on the AWS dashboard saying you don't have access to view certain resources - '''this is normal''', so just ignore the error messages. == Configuring Account Credentials == Once you login to the gi-gateway, you will have very few permissions to do anything there - which is normal, since you will not be working in that account anyway. The gi-gateway account is just there to authenticate you to AWS. '''Changing Your Password''' You can change your password by clicking on your username on the top right of the web browser window, just to the right of the little bell. If your username is bill@ucsc.edu, for example: * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window. * Click the '''"My Security Credentials"''' drop-down menu option. * Click the '''"Change Password"''' button to change your password. Note that we have a password strength policy in place, so your password must conform to the following requirements: * Your password must be at least 10 characters long * Your password must contain at least one lowercase letter * Your password must contain at least one non-alphanumeric character * Your password must contain at least one number You will also need to configure MFA on your account before you will be allowed to switch roles into another account. '''Configuring MFA''' To configure MFA (Multi Factor Authentication), the most common way to do it is to use '''Google Authenticator''', which is an app available for Apple and Android based cell phones and mobile devices. The app is free, simply download it from the app store to your cell phone or tablet to get started. Other MFA apps may also work but we have not tested everything out there. Once you have Google Authenticator installed, log into the gi-gateway account using the above URL, then: * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window (again, bill@ucsc.edu is an example). * Click the '''"My Security Credentials"''' drop-down menu option. * Scroll down to the MFA (Multi-Factor Authentication) section of the page, and click '''"Assign MFA Device"'''. * In the following menu select '''"Virtual MFA Device"'''. * In the following window click the '''"Show QR Code"''' link, and the MFA QR barcode will appear on your screen. * Open the Google Authenticator app on your mobile device, and click the little "+" symbol in the top right corner of the app to add an account. * You will then need to select "Scan Barcode" in the Google Authenticaor app to continue, and aim your mobile device camera at the QR barcode. * The new account MFA device should then be set up and you should see a 6 digit number with a small timer to the right of it. You must type in one 6 digit code that it displays into your web browser when asked, then wait for the next code to appear after the timer expires, and type that into the second field. It should then inform you that you have successfully associated an MFA device with your account. Once you have associated an MFA device with the 'gi-gateway' Account, '''log out''', then log back in. It will ask for your username and password, and then ask for your MFA code, which you can view by opening Google Authenticator and seeing what code it is displaying at that time. The code changes every 30 seconds or so. '''You must log out first and log back in using MFA in order to be able to switch roles!!!''' == Switching Roles into Another AWS Account == Now that you have configured a password and enabled MFA, you will be allowed to "Switch Roles" into another account such that you can begin work there. The first time you switch roles into an account it will ask you a few questions, but subsequently it will remember which roles you have access to and they will become a menu item you can click on to quickly switch roles. Let's assume that you want to switch to the 'pangenomics' AWS account, and you have been already granted access to do so by the cluster-admin group. After logging into the 'gi-gateway' account at the URL listed here (same as above): https://gi-gateway.signin.aws.amazon.com/console Do the following to switch roles into the 'pangenomics' account (as an example): * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window (again, bill@ucsc.edu is an example). * Click the '''"Switch Role"''' option in the drop-down menu. * In the following menu it will ask you about the role you will be assuming. In our example we will use the following: -Account* = pangenomics -Role* = developer -Display Name = [leave blank, or use a short phrase] -Color = [choose a color for this role] * Then click the "Switch Role" button. If all went well you should be dumped into the 'pangenomics' account, and you should be identified in the top right hand corner of the page as '''"developer @ pangenomics"''', indicating your role and the account you are active in. You can then work as normal in that account. If you have not yet been given access to that role, you will receive an error message and not be allowed to switch roles. '''NOTE:''' When you switch roles, it may dump you into a region that you don't expect it to. Always verify the region you are in by looking at the top right of the web page - it will display your region there. Most of our stuff exists in "Oregon" (us-west-2), but some items appear in other regions on a per-case basis. If you wish to switch context back to the 'gi-gateway' account in order to manage something, or to switch to another role in another account, simple: * Click '''"developer @ pangenomics"''' in the top right corner of the window. * Select '''"Back to bill@ucsc.edu"''' You will then be sent back to the 'gi-gateway' context, and you can add another role to switch into, manage your credentials and further switch roles. == API Access and Secret Keys == If you require programmatic access to AWS, you will very likely be familiar with the AWS concept of Access Keys and Secret Keys, which can be used by scripts to authenticate yourself to AWS and use the APIs there without using the web console to authenticate. In the past, access keys and secret keys could be used by users with no further authentication. This introduces a security risk, as the management of those keys must be carefully guarded - if anyone gets your keys, they can rack up charges on your AWS account without your knowledge! Using the "Assume Role" mechanism we are now using, Access Keys and Secret Keys can still be created by users '''while logged into the gi-gateway account only'''. Do not try to create keys while you have "Switched Roles" into another account. Keys you create in the top level 'gi-gateway' account will work for you in any sub-account you have access to switch roles to. You will need to do a little more configuration for your keys to work from a UNIX command line however. When using the '''"aws"''' command line tool, assuming you have it installed (the process of which is outside the scope of this document), you would use the steps outlined in this document to configure it: https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html That document has a lot of other really useful information in it - if you plan on using keys for API access, we highly recommend reading it through. Generically, if you plan on using keys for API Access, minimally you will need to configure the "aws" utility and then tweak the config a bit for our setup. To start, run "aws configure". It should look something like this: $ aws configure AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE AWS Secret Access Key [None]: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY Default region name [None]: us-west-2 Default output format [None]: Most folks do that to start. It creates two files: ~/.aws/config ~/.aws/credentials Those two files are important to access AWS via the 'aws' command. '''~/.aws/credentials''' This file contains your access key and secret key, and should not need to be modified after running 'aws configure'. Your same keys can be used to access any roles in any accounts you have access to. '''~/.aws/config''' This file contains some account information you will need to tweak. You may want to configure something like this: [default] region = us-west-2 [profile pangenomics-developer] source_profile = default role_arn = arn:aws:iam::422448306679:role/developer mfa_serial = arn:aws:iam::652235167018:mfa/bill@ucsc.edu The 'role_arn' line contains the role and account number you are accessing. You can see a list of live account numbers here: [[AWS Account List and Numbers]]. Find the account number you need and enter it on the role_arn line, as well as the role name. You will get the role name from the cluster-admin group when you get access. The 'mfa_serial' line contains the identifier for your MFA device. It will always look like '''"arn:aws:iam::652235167018:mfa/[your_iam_username]"'''. The account number listed there will always be '652235167018' because that is the account number of the top level 'gi-gateway' account. Once that is configured, you should be able to reference the profile you just created when using the aws command, like so: $ aws s3 ls --profile pangenomics-developer It will ask you for your MFA code and then run the command. Once you enter the MFA code, the token it creates will be valid by default for one hour, so you can run other 'aws' cli commands for one hour without the need to re-authenticate with MFA. After one hour, you will need to authenticate via MFA again. You can extend your session length from one hour to twelve hours but utilizing the AWS Security Token Service (AWS STS). See this page for more information on how to do this: https://docs.aws.amazon.com/cli/latest/reference/sts/assume-role.html#examples The examples at the bottom are particularly useful. 4eec076d42d84559a33730062190eda9ade28523 129 128 2019-02-08T21:13:16Z Weiler 3 /* API Access and Secret Keys */ wikitext text/x-wiki __TOC__ == Getting AWS (Amazon Web Services) Access == The Genomics Institute has a series of AWS Accounts that all support different projects. Often if you become associated with one or more of those projects, you will need access to that account or accounts. The way we are managing AWS IAM Account Access is that we have one AWS account that is the 'top level' account that everyone gets access to, and then, once you log in there, you can "Switch Role" into another sub-account that you are running things in. To get access, you will need your PI or Project Manager to email cluster-admin (cluster-admin@soe.ucsc.edu) asking for an AWS account for you, and also in that email to name the projects you will have access to. The cluster-admin group will contact you with your credentials to login. Once you login, you can change your password if you want to and also you will be able to set up MFA (Multi Factor Authentication) for your account. You will be required to use MFA in order to "Switch Role" into any of the sub-accounts for the projects you are working on. The login URL to use when logging in to the top level account is listed below. The top level account is known as "gi-gateway": https://gi-gateway.signin.aws.amazon.com/console When you login, you '''may''' see a couple error messages on the AWS dashboard saying you don't have access to view certain resources - '''this is normal''', so just ignore the error messages. == Configuring Account Credentials == Once you login to the gi-gateway, you will have very few permissions to do anything there - which is normal, since you will not be working in that account anyway. The gi-gateway account is just there to authenticate you to AWS. '''Changing Your Password''' You can change your password by clicking on your username on the top right of the web browser window, just to the right of the little bell. If your username is bill@ucsc.edu, for example: * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window. * Click the '''"My Security Credentials"''' drop-down menu option. * Click the '''"Change Password"''' button to change your password. Note that we have a password strength policy in place, so your password must conform to the following requirements: * Your password must be at least 10 characters long * Your password must contain at least one lowercase letter * Your password must contain at least one non-alphanumeric character * Your password must contain at least one number You will also need to configure MFA on your account before you will be allowed to switch roles into another account. '''Configuring MFA''' To configure MFA (Multi Factor Authentication), the most common way to do it is to use '''Google Authenticator''', which is an app available for Apple and Android based cell phones and mobile devices. The app is free, simply download it from the app store to your cell phone or tablet to get started. Other MFA apps may also work but we have not tested everything out there. Once you have Google Authenticator installed, log into the gi-gateway account using the above URL, then: * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window (again, bill@ucsc.edu is an example). * Click the '''"My Security Credentials"''' drop-down menu option. * Scroll down to the MFA (Multi-Factor Authentication) section of the page, and click '''"Assign MFA Device"'''. * In the following menu select '''"Virtual MFA Device"'''. * In the following window click the '''"Show QR Code"''' link, and the MFA QR barcode will appear on your screen. * Open the Google Authenticator app on your mobile device, and click the little "+" symbol in the top right corner of the app to add an account. * You will then need to select "Scan Barcode" in the Google Authenticaor app to continue, and aim your mobile device camera at the QR barcode. * The new account MFA device should then be set up and you should see a 6 digit number with a small timer to the right of it. You must type in one 6 digit code that it displays into your web browser when asked, then wait for the next code to appear after the timer expires, and type that into the second field. It should then inform you that you have successfully associated an MFA device with your account. Once you have associated an MFA device with the 'gi-gateway' Account, '''log out''', then log back in. It will ask for your username and password, and then ask for your MFA code, which you can view by opening Google Authenticator and seeing what code it is displaying at that time. The code changes every 30 seconds or so. '''You must log out first and log back in using MFA in order to be able to switch roles!!!''' == Switching Roles into Another AWS Account == Now that you have configured a password and enabled MFA, you will be allowed to "Switch Roles" into another account such that you can begin work there. The first time you switch roles into an account it will ask you a few questions, but subsequently it will remember which roles you have access to and they will become a menu item you can click on to quickly switch roles. Let's assume that you want to switch to the 'pangenomics' AWS account, and you have been already granted access to do so by the cluster-admin group. After logging into the 'gi-gateway' account at the URL listed here (same as above): https://gi-gateway.signin.aws.amazon.com/console Do the following to switch roles into the 'pangenomics' account (as an example): * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window (again, bill@ucsc.edu is an example). * Click the '''"Switch Role"''' option in the drop-down menu. * In the following menu it will ask you about the role you will be assuming. In our example we will use the following: -Account* = pangenomics -Role* = developer -Display Name = [leave blank, or use a short phrase] -Color = [choose a color for this role] * Then click the "Switch Role" button. If all went well you should be dumped into the 'pangenomics' account, and you should be identified in the top right hand corner of the page as '''"developer @ pangenomics"''', indicating your role and the account you are active in. You can then work as normal in that account. If you have not yet been given access to that role, you will receive an error message and not be allowed to switch roles. '''NOTE:''' When you switch roles, it may dump you into a region that you don't expect it to. Always verify the region you are in by looking at the top right of the web page - it will display your region there. Most of our stuff exists in "Oregon" (us-west-2), but some items appear in other regions on a per-case basis. If you wish to switch context back to the 'gi-gateway' account in order to manage something, or to switch to another role in another account, simple: * Click '''"developer @ pangenomics"''' in the top right corner of the window. * Select '''"Back to bill@ucsc.edu"''' You will then be sent back to the 'gi-gateway' context, and you can add another role to switch into, manage your credentials and further switch roles. == API Access and Secret Keys == If you require programmatic access to AWS, you will very likely be familiar with the AWS concept of Access Keys and Secret Keys, which can be used by scripts to authenticate yourself to AWS and use the APIs there without using the web console to authenticate. In the past, access keys and secret keys could be used by users with no further authentication. This introduces a security risk, as the management of those keys must be carefully guarded - if anyone gets your keys, they can rack up charges on your AWS account without your knowledge! Using the "Assume Role" mechanism we are now using, Access Keys and Secret Keys can still be created by users '''while logged into the gi-gateway account only'''. Do not try to create keys while you have "Switched Roles" into another account. Keys you create in the top level 'gi-gateway' account will work for you in any sub-account you have access to switch roles to. You will need to do a little more configuration for your keys to work from a UNIX command line however. When using the '''"aws"''' command line tool, assuming you have it installed (the process of which is outside the scope of this document), you would use the steps outlined in this document to configure it: https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html That document has a lot of other really useful information in it - if you plan on using keys for API access, we highly recommend reading it through. Generically, if you plan on using keys for API Access, minimally you will need to configure the "aws" utility and then tweak the config a bit for our setup. To start, run "aws configure". It should look something like this: $ aws configure AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE AWS Secret Access Key [None]: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY Default region name [None]: us-west-2 Default output format [None]: Most folks do that to start. It creates two files: ~/.aws/config ~/.aws/credentials Those two files are important to access AWS via the 'aws' command. '''~/.aws/credentials''' This file contains your access key and secret key, and should not need to be modified after running 'aws configure'. Your same keys can be used to access any roles in any accounts you have access to. '''~/.aws/config''' This file contains some account information you will need to tweak. You may want to configure something like this: [default] region = us-west-2 [profile pangenomics-developer] source_profile = default role_arn = arn:aws:iam::422448306679:role/developer mfa_serial = arn:aws:iam::652235167018:mfa/bill@ucsc.edu duration_seconds = 43200 The 'role_arn' line contains the role and account number you are accessing. You can see a list of live account numbers here: [[AWS Account List and Numbers]]. Find the account number you need and enter it on the role_arn line, as well as the role name. You will get the role name from the cluster-admin group when you get access. The 'mfa_serial' line contains the identifier for your MFA device. It will always look like '''"arn:aws:iam::652235167018:mfa/[your_iam_username]"'''. The account number listed there will always be '652235167018' because that is the account number of the top level 'gi-gateway' account. The "duration_seconds" parameter says that your session token will be 43200 seconds long (12 hours). That means you will only have to authenticate with MFA once every 12 hours '''while you are using that same shell session'''. 12 hours is the maximum you can request, although you can specify less than that. This means it won't ask you for MFA every time you run a command for the next 12 hours in your current shell. Once that is configured, you should be able to reference the profile you just created when using the aws command, like so: $ aws s3 ls --profile pangenomics-developer It will ask you for your MFA code and then run the command. Once you enter the MFA code, the token it creates will be valid by default for one hour, so you can run other 'aws' cli commands for one hour without the need to re-authenticate with MFA. After one hour, you will need to authenticate via MFA again. You can extend your session length from one hour to twelve hours but utilizing the AWS Security Token Service (AWS STS). See this page for more information on how to do this: https://docs.aws.amazon.com/cli/latest/reference/sts/assume-role.html#examples The examples at the bottom are particularly useful. a9cd1dd1832749cccccbe9406103c8cbbdec2375 130 129 2019-02-08T21:14:30Z Weiler 3 /* API Access and Secret Keys */ wikitext text/x-wiki __TOC__ == Getting AWS (Amazon Web Services) Access == The Genomics Institute has a series of AWS Accounts that all support different projects. Often if you become associated with one or more of those projects, you will need access to that account or accounts. The way we are managing AWS IAM Account Access is that we have one AWS account that is the 'top level' account that everyone gets access to, and then, once you log in there, you can "Switch Role" into another sub-account that you are running things in. To get access, you will need your PI or Project Manager to email cluster-admin (cluster-admin@soe.ucsc.edu) asking for an AWS account for you, and also in that email to name the projects you will have access to. The cluster-admin group will contact you with your credentials to login. Once you login, you can change your password if you want to and also you will be able to set up MFA (Multi Factor Authentication) for your account. You will be required to use MFA in order to "Switch Role" into any of the sub-accounts for the projects you are working on. The login URL to use when logging in to the top level account is listed below. The top level account is known as "gi-gateway": https://gi-gateway.signin.aws.amazon.com/console When you login, you '''may''' see a couple error messages on the AWS dashboard saying you don't have access to view certain resources - '''this is normal''', so just ignore the error messages. == Configuring Account Credentials == Once you login to the gi-gateway, you will have very few permissions to do anything there - which is normal, since you will not be working in that account anyway. The gi-gateway account is just there to authenticate you to AWS. '''Changing Your Password''' You can change your password by clicking on your username on the top right of the web browser window, just to the right of the little bell. If your username is bill@ucsc.edu, for example: * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window. * Click the '''"My Security Credentials"''' drop-down menu option. * Click the '''"Change Password"''' button to change your password. Note that we have a password strength policy in place, so your password must conform to the following requirements: * Your password must be at least 10 characters long * Your password must contain at least one lowercase letter * Your password must contain at least one non-alphanumeric character * Your password must contain at least one number You will also need to configure MFA on your account before you will be allowed to switch roles into another account. '''Configuring MFA''' To configure MFA (Multi Factor Authentication), the most common way to do it is to use '''Google Authenticator''', which is an app available for Apple and Android based cell phones and mobile devices. The app is free, simply download it from the app store to your cell phone or tablet to get started. Other MFA apps may also work but we have not tested everything out there. Once you have Google Authenticator installed, log into the gi-gateway account using the above URL, then: * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window (again, bill@ucsc.edu is an example). * Click the '''"My Security Credentials"''' drop-down menu option. * Scroll down to the MFA (Multi-Factor Authentication) section of the page, and click '''"Assign MFA Device"'''. * In the following menu select '''"Virtual MFA Device"'''. * In the following window click the '''"Show QR Code"''' link, and the MFA QR barcode will appear on your screen. * Open the Google Authenticator app on your mobile device, and click the little "+" symbol in the top right corner of the app to add an account. * You will then need to select "Scan Barcode" in the Google Authenticaor app to continue, and aim your mobile device camera at the QR barcode. * The new account MFA device should then be set up and you should see a 6 digit number with a small timer to the right of it. You must type in one 6 digit code that it displays into your web browser when asked, then wait for the next code to appear after the timer expires, and type that into the second field. It should then inform you that you have successfully associated an MFA device with your account. Once you have associated an MFA device with the 'gi-gateway' Account, '''log out''', then log back in. It will ask for your username and password, and then ask for your MFA code, which you can view by opening Google Authenticator and seeing what code it is displaying at that time. The code changes every 30 seconds or so. '''You must log out first and log back in using MFA in order to be able to switch roles!!!''' == Switching Roles into Another AWS Account == Now that you have configured a password and enabled MFA, you will be allowed to "Switch Roles" into another account such that you can begin work there. The first time you switch roles into an account it will ask you a few questions, but subsequently it will remember which roles you have access to and they will become a menu item you can click on to quickly switch roles. Let's assume that you want to switch to the 'pangenomics' AWS account, and you have been already granted access to do so by the cluster-admin group. After logging into the 'gi-gateway' account at the URL listed here (same as above): https://gi-gateway.signin.aws.amazon.com/console Do the following to switch roles into the 'pangenomics' account (as an example): * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window (again, bill@ucsc.edu is an example). * Click the '''"Switch Role"''' option in the drop-down menu. * In the following menu it will ask you about the role you will be assuming. In our example we will use the following: -Account* = pangenomics -Role* = developer -Display Name = [leave blank, or use a short phrase] -Color = [choose a color for this role] * Then click the "Switch Role" button. If all went well you should be dumped into the 'pangenomics' account, and you should be identified in the top right hand corner of the page as '''"developer @ pangenomics"''', indicating your role and the account you are active in. You can then work as normal in that account. If you have not yet been given access to that role, you will receive an error message and not be allowed to switch roles. '''NOTE:''' When you switch roles, it may dump you into a region that you don't expect it to. Always verify the region you are in by looking at the top right of the web page - it will display your region there. Most of our stuff exists in "Oregon" (us-west-2), but some items appear in other regions on a per-case basis. If you wish to switch context back to the 'gi-gateway' account in order to manage something, or to switch to another role in another account, simple: * Click '''"developer @ pangenomics"''' in the top right corner of the window. * Select '''"Back to bill@ucsc.edu"''' You will then be sent back to the 'gi-gateway' context, and you can add another role to switch into, manage your credentials and further switch roles. == API Access and Secret Keys == If you require programmatic access to AWS, you will very likely be familiar with the AWS concept of Access Keys and Secret Keys, which can be used by scripts to authenticate yourself to AWS and use the APIs there without using the web console to authenticate. In the past, access keys and secret keys could be used by users with no further authentication. This introduces a security risk, as the management of those keys must be carefully guarded - if anyone gets your keys, they can rack up charges on your AWS account without your knowledge! Using the "Assume Role" mechanism we are now using, Access Keys and Secret Keys can still be created by users '''while logged into the gi-gateway account only'''. Do not try to create keys while you have "Switched Roles" into another account. Keys you create in the top level 'gi-gateway' account will work for you in any sub-account you have access to switch roles to. You will need to do a little more configuration for your keys to work from a UNIX command line however. When using the '''"aws"''' command line tool, assuming you have it installed (the process of which is outside the scope of this document), you would use the steps outlined in this document to configure it: https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html That document has a lot of other really useful information in it - if you plan on using keys for API access, we highly recommend reading it through. Generically, if you plan on using keys for API Access, minimally you will need to configure the "aws" utility and then tweak the config a bit for our setup. To start, run "aws configure". It should look something like this: $ aws configure AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE AWS Secret Access Key [None]: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY Default region name [None]: us-west-2 Default output format [None]: Most folks do that to start. It creates two files: ~/.aws/config ~/.aws/credentials Those two files are important to access AWS via the 'aws' command. '''~/.aws/credentials''' This file contains your access key and secret key, and should not need to be modified after running 'aws configure'. Your same keys can be used to access any roles in any accounts you have access to. '''~/.aws/config''' This file contains some account information you will need to tweak. You may want to configure something like this: [default] region = us-west-2 [profile pangenomics-developer] source_profile = default role_arn = arn:aws:iam::422448306679:role/developer mfa_serial = arn:aws:iam::652235167018:mfa/bill@ucsc.edu duration_seconds = 43200 The 'role_arn' line contains the role and account number you are accessing. You can see a list of live account numbers here: [[AWS Account List and Numbers]]. Find the account number you need and enter it on the role_arn line, as well as the role name. You will get the role name from the cluster-admin group when you get access. The 'mfa_serial' line contains the identifier for your MFA device. It will always look like '''"arn:aws:iam::652235167018:mfa/[your_iam_username]"'''. The account number listed there will always be '652235167018' because that is the account number of the top level 'gi-gateway' account. The "duration_seconds" parameter says that your session token will be 43200 seconds long (12 hours). That means you will only have to authenticate with MFA once every 12 hours. 12 hours is the maximum you can request, although you can specify less than that. This means it won't ask you for MFA every time you run a command for the next 12 hours. Once that is configured, you should be able to reference the profile you just created when using the aws command, like so: $ aws s3 ls --profile pangenomics-developer It will ask you for your MFA code and then run the command. Once you enter the MFA code, the token it creates will be valid by default for one hour, so you can run other 'aws' cli commands for one hour without the need to re-authenticate with MFA. After one hour, you will need to authenticate via MFA again. You can extend your session length from one hour to twelve hours but utilizing the AWS Security Token Service (AWS STS). See this page for more information on how to do this: https://docs.aws.amazon.com/cli/latest/reference/sts/assume-role.html#examples The examples at the bottom are particularly useful. b2189eff93e229dffb577f0aed2950ccaf379ce8 131 130 2019-02-08T21:27:36Z Weiler 3 /* API Access and Secret Keys */ wikitext text/x-wiki __TOC__ == Getting AWS (Amazon Web Services) Access == The Genomics Institute has a series of AWS Accounts that all support different projects. Often if you become associated with one or more of those projects, you will need access to that account or accounts. The way we are managing AWS IAM Account Access is that we have one AWS account that is the 'top level' account that everyone gets access to, and then, once you log in there, you can "Switch Role" into another sub-account that you are running things in. To get access, you will need your PI or Project Manager to email cluster-admin (cluster-admin@soe.ucsc.edu) asking for an AWS account for you, and also in that email to name the projects you will have access to. The cluster-admin group will contact you with your credentials to login. Once you login, you can change your password if you want to and also you will be able to set up MFA (Multi Factor Authentication) for your account. You will be required to use MFA in order to "Switch Role" into any of the sub-accounts for the projects you are working on. The login URL to use when logging in to the top level account is listed below. The top level account is known as "gi-gateway": https://gi-gateway.signin.aws.amazon.com/console When you login, you '''may''' see a couple error messages on the AWS dashboard saying you don't have access to view certain resources - '''this is normal''', so just ignore the error messages. == Configuring Account Credentials == Once you login to the gi-gateway, you will have very few permissions to do anything there - which is normal, since you will not be working in that account anyway. The gi-gateway account is just there to authenticate you to AWS. '''Changing Your Password''' You can change your password by clicking on your username on the top right of the web browser window, just to the right of the little bell. If your username is bill@ucsc.edu, for example: * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window. * Click the '''"My Security Credentials"''' drop-down menu option. * Click the '''"Change Password"''' button to change your password. Note that we have a password strength policy in place, so your password must conform to the following requirements: * Your password must be at least 10 characters long * Your password must contain at least one lowercase letter * Your password must contain at least one non-alphanumeric character * Your password must contain at least one number You will also need to configure MFA on your account before you will be allowed to switch roles into another account. '''Configuring MFA''' To configure MFA (Multi Factor Authentication), the most common way to do it is to use '''Google Authenticator''', which is an app available for Apple and Android based cell phones and mobile devices. The app is free, simply download it from the app store to your cell phone or tablet to get started. Other MFA apps may also work but we have not tested everything out there. Once you have Google Authenticator installed, log into the gi-gateway account using the above URL, then: * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window (again, bill@ucsc.edu is an example). * Click the '''"My Security Credentials"''' drop-down menu option. * Scroll down to the MFA (Multi-Factor Authentication) section of the page, and click '''"Assign MFA Device"'''. * In the following menu select '''"Virtual MFA Device"'''. * In the following window click the '''"Show QR Code"''' link, and the MFA QR barcode will appear on your screen. * Open the Google Authenticator app on your mobile device, and click the little "+" symbol in the top right corner of the app to add an account. * You will then need to select "Scan Barcode" in the Google Authenticaor app to continue, and aim your mobile device camera at the QR barcode. * The new account MFA device should then be set up and you should see a 6 digit number with a small timer to the right of it. You must type in one 6 digit code that it displays into your web browser when asked, then wait for the next code to appear after the timer expires, and type that into the second field. It should then inform you that you have successfully associated an MFA device with your account. Once you have associated an MFA device with the 'gi-gateway' Account, '''log out''', then log back in. It will ask for your username and password, and then ask for your MFA code, which you can view by opening Google Authenticator and seeing what code it is displaying at that time. The code changes every 30 seconds or so. '''You must log out first and log back in using MFA in order to be able to switch roles!!!''' == Switching Roles into Another AWS Account == Now that you have configured a password and enabled MFA, you will be allowed to "Switch Roles" into another account such that you can begin work there. The first time you switch roles into an account it will ask you a few questions, but subsequently it will remember which roles you have access to and they will become a menu item you can click on to quickly switch roles. Let's assume that you want to switch to the 'pangenomics' AWS account, and you have been already granted access to do so by the cluster-admin group. After logging into the 'gi-gateway' account at the URL listed here (same as above): https://gi-gateway.signin.aws.amazon.com/console Do the following to switch roles into the 'pangenomics' account (as an example): * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window (again, bill@ucsc.edu is an example). * Click the '''"Switch Role"''' option in the drop-down menu. * In the following menu it will ask you about the role you will be assuming. In our example we will use the following: -Account* = pangenomics -Role* = developer -Display Name = [leave blank, or use a short phrase] -Color = [choose a color for this role] * Then click the "Switch Role" button. If all went well you should be dumped into the 'pangenomics' account, and you should be identified in the top right hand corner of the page as '''"developer @ pangenomics"''', indicating your role and the account you are active in. You can then work as normal in that account. If you have not yet been given access to that role, you will receive an error message and not be allowed to switch roles. '''NOTE:''' When you switch roles, it may dump you into a region that you don't expect it to. Always verify the region you are in by looking at the top right of the web page - it will display your region there. Most of our stuff exists in "Oregon" (us-west-2), but some items appear in other regions on a per-case basis. If you wish to switch context back to the 'gi-gateway' account in order to manage something, or to switch to another role in another account, simple: * Click '''"developer @ pangenomics"''' in the top right corner of the window. * Select '''"Back to bill@ucsc.edu"''' You will then be sent back to the 'gi-gateway' context, and you can add another role to switch into, manage your credentials and further switch roles. == API Access and Secret Keys == If you require programmatic access to AWS, you will very likely be familiar with the AWS concept of Access Keys and Secret Keys, which can be used by scripts to authenticate yourself to AWS and use the APIs there without using the web console to authenticate. In the past, access keys and secret keys could be used by users with no further authentication. This introduces a security risk, as the management of those keys must be carefully guarded - if anyone gets your keys, they can rack up charges on your AWS account without your knowledge! Using the "Assume Role" mechanism we are now using, Access Keys and Secret Keys can still be created by users '''while logged into the gi-gateway account only'''. Do not try to create keys while you have "Switched Roles" into another account. Keys you create in the top level 'gi-gateway' account will work for you in any sub-account you have access to switch roles to. You will need to do a little more configuration for your keys to work from a UNIX command line however. When using the '''"aws"''' command line tool, assuming you have it installed (the process of which is outside the scope of this document), you would use the steps outlined in this document to configure it: https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html That document has a lot of other really useful information in it - if you plan on using keys for API access, we highly recommend reading it through. Generically, if you plan on using keys for API Access, minimally you will need to configure the "aws" utility and then tweak the config a bit for our setup. To start, run "aws configure". It should look something like this: $ aws configure AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE AWS Secret Access Key [None]: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY Default region name [None]: us-west-2 Default output format [None]: Most folks do that to start. It creates two files: ~/.aws/config ~/.aws/credentials Those two files are important to access AWS via the 'aws' command. '''~/.aws/credentials''' This file contains your access key and secret key, and should not need to be modified after running 'aws configure'. Your same keys can be used to access any roles in any accounts you have access to. '''~/.aws/config''' This file contains some account information you will need to tweak. You may want to configure something like this: [default] region = us-west-2 [profile pangenomics-developer] source_profile = default role_arn = arn:aws:iam::422448306679:role/developer mfa_serial = arn:aws:iam::652235167018:mfa/bill@ucsc.edu duration_seconds = 43200 The 'role_arn' line contains the role and account number you are accessing. You can see a list of live account numbers here: [[AWS Account List and Numbers]] Find the account number you need and enter it on the role_arn line, as well as the role name. You will get the role name from the cluster-admin group when you get access. The 'mfa_serial' line contains the identifier for your MFA device. It will always look like '''"arn:aws:iam::652235167018:mfa/[your_iam_username]"'''. The account number listed there will always be '652235167018' because that is the account number of the top level 'gi-gateway' account. The "duration_seconds" parameter says that your session token will be 43200 seconds long (12 hours). That means you will only have to authenticate with MFA once every 12 hours. 12 hours is the maximum you can request, although you can specify less than that. This means it won't ask you for MFA every time you run a command for the next 12 hours. Once that is configured, you should be able to reference the profile you just created when using the aws command, like so: $ aws s3 ls --profile pangenomics-developer It will ask you for your MFA code and then run the command. Once you enter the MFA code, the token it creates will be valid by default for one hour, so you can run other 'aws' cli commands for one hour without the need to re-authenticate with MFA. After one hour, you will need to authenticate via MFA again. You can extend your session length from one hour to twelve hours but utilizing the AWS Security Token Service (AWS STS). See this page for more information on how to do this: https://docs.aws.amazon.com/cli/latest/reference/sts/assume-role.html#examples The examples at the bottom are particularly useful. e10a0de7a94a76feebabe12d64148671d555f3db 132 131 2019-02-08T21:29:52Z Weiler 3 /* API Access and Secret Keys */ wikitext text/x-wiki __TOC__ == Getting AWS (Amazon Web Services) Access == The Genomics Institute has a series of AWS Accounts that all support different projects. Often if you become associated with one or more of those projects, you will need access to that account or accounts. The way we are managing AWS IAM Account Access is that we have one AWS account that is the 'top level' account that everyone gets access to, and then, once you log in there, you can "Switch Role" into another sub-account that you are running things in. To get access, you will need your PI or Project Manager to email cluster-admin (cluster-admin@soe.ucsc.edu) asking for an AWS account for you, and also in that email to name the projects you will have access to. The cluster-admin group will contact you with your credentials to login. Once you login, you can change your password if you want to and also you will be able to set up MFA (Multi Factor Authentication) for your account. You will be required to use MFA in order to "Switch Role" into any of the sub-accounts for the projects you are working on. The login URL to use when logging in to the top level account is listed below. The top level account is known as "gi-gateway": https://gi-gateway.signin.aws.amazon.com/console When you login, you '''may''' see a couple error messages on the AWS dashboard saying you don't have access to view certain resources - '''this is normal''', so just ignore the error messages. == Configuring Account Credentials == Once you login to the gi-gateway, you will have very few permissions to do anything there - which is normal, since you will not be working in that account anyway. The gi-gateway account is just there to authenticate you to AWS. '''Changing Your Password''' You can change your password by clicking on your username on the top right of the web browser window, just to the right of the little bell. If your username is bill@ucsc.edu, for example: * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window. * Click the '''"My Security Credentials"''' drop-down menu option. * Click the '''"Change Password"''' button to change your password. Note that we have a password strength policy in place, so your password must conform to the following requirements: * Your password must be at least 10 characters long * Your password must contain at least one lowercase letter * Your password must contain at least one non-alphanumeric character * Your password must contain at least one number You will also need to configure MFA on your account before you will be allowed to switch roles into another account. '''Configuring MFA''' To configure MFA (Multi Factor Authentication), the most common way to do it is to use '''Google Authenticator''', which is an app available for Apple and Android based cell phones and mobile devices. The app is free, simply download it from the app store to your cell phone or tablet to get started. Other MFA apps may also work but we have not tested everything out there. Once you have Google Authenticator installed, log into the gi-gateway account using the above URL, then: * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window (again, bill@ucsc.edu is an example). * Click the '''"My Security Credentials"''' drop-down menu option. * Scroll down to the MFA (Multi-Factor Authentication) section of the page, and click '''"Assign MFA Device"'''. * In the following menu select '''"Virtual MFA Device"'''. * In the following window click the '''"Show QR Code"''' link, and the MFA QR barcode will appear on your screen. * Open the Google Authenticator app on your mobile device, and click the little "+" symbol in the top right corner of the app to add an account. * You will then need to select "Scan Barcode" in the Google Authenticaor app to continue, and aim your mobile device camera at the QR barcode. * The new account MFA device should then be set up and you should see a 6 digit number with a small timer to the right of it. You must type in one 6 digit code that it displays into your web browser when asked, then wait for the next code to appear after the timer expires, and type that into the second field. It should then inform you that you have successfully associated an MFA device with your account. Once you have associated an MFA device with the 'gi-gateway' Account, '''log out''', then log back in. It will ask for your username and password, and then ask for your MFA code, which you can view by opening Google Authenticator and seeing what code it is displaying at that time. The code changes every 30 seconds or so. '''You must log out first and log back in using MFA in order to be able to switch roles!!!''' == Switching Roles into Another AWS Account == Now that you have configured a password and enabled MFA, you will be allowed to "Switch Roles" into another account such that you can begin work there. The first time you switch roles into an account it will ask you a few questions, but subsequently it will remember which roles you have access to and they will become a menu item you can click on to quickly switch roles. Let's assume that you want to switch to the 'pangenomics' AWS account, and you have been already granted access to do so by the cluster-admin group. After logging into the 'gi-gateway' account at the URL listed here (same as above): https://gi-gateway.signin.aws.amazon.com/console Do the following to switch roles into the 'pangenomics' account (as an example): * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window (again, bill@ucsc.edu is an example). * Click the '''"Switch Role"''' option in the drop-down menu. * In the following menu it will ask you about the role you will be assuming. In our example we will use the following: -Account* = pangenomics -Role* = developer -Display Name = [leave blank, or use a short phrase] -Color = [choose a color for this role] * Then click the "Switch Role" button. If all went well you should be dumped into the 'pangenomics' account, and you should be identified in the top right hand corner of the page as '''"developer @ pangenomics"''', indicating your role and the account you are active in. You can then work as normal in that account. If you have not yet been given access to that role, you will receive an error message and not be allowed to switch roles. '''NOTE:''' When you switch roles, it may dump you into a region that you don't expect it to. Always verify the region you are in by looking at the top right of the web page - it will display your region there. Most of our stuff exists in "Oregon" (us-west-2), but some items appear in other regions on a per-case basis. If you wish to switch context back to the 'gi-gateway' account in order to manage something, or to switch to another role in another account, simple: * Click '''"developer @ pangenomics"''' in the top right corner of the window. * Select '''"Back to bill@ucsc.edu"''' You will then be sent back to the 'gi-gateway' context, and you can add another role to switch into, manage your credentials and further switch roles. == API Access and Secret Keys == If you require programmatic access to AWS, you will very likely be familiar with the AWS concept of Access Keys and Secret Keys, which can be used by scripts to authenticate yourself to AWS and use the APIs there without using the web console to authenticate. In the past, access keys and secret keys could be used by users with no further authentication. This introduces a security risk, as the management of those keys must be carefully guarded - if anyone gets your keys, they can rack up charges on your AWS account without your knowledge! Using the "Assume Role" mechanism we are now using, Access Keys and Secret Keys can still be created by users '''while logged into the gi-gateway account only'''. Do not try to create keys while you have "Switched Roles" into another account. Keys you create in the top level 'gi-gateway' account will work for you in any sub-account you have access to switch roles to. You will need to do a little more configuration for your keys to work from a UNIX command line however. When using the '''"aws"''' command line tool, assuming you have it installed (the process of which is outside the scope of this document), you would use the steps outlined in this document to configure it: https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html That document has a lot of other really useful information in it - if you plan on using keys for API access, we highly recommend reading it through. Generically, if you plan on using keys for API Access, minimally you will need to configure the "aws" utility and then tweak the config a bit for our setup. To start, run "aws configure". It should look something like this: $ aws configure AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE AWS Secret Access Key [None]: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY Default region name [None]: us-west-2 Default output format [None]: Most folks do that to start. It creates two files: ~/.aws/config ~/.aws/credentials Those two files are important to access AWS via the 'aws' command. '''~/.aws/credentials''' This file contains your access key and secret key, and should not need to be modified after running 'aws configure'. Your same keys can be used to access any roles in any accounts you have access to. '''~/.aws/config''' This file contains some account information you will need to tweak. You may want to configure something like this: [default] region = us-west-2 [profile pangenomics-developer] source_profile = default role_arn = arn:aws:iam::422448306679:role/developer mfa_serial = arn:aws:iam::652235167018:mfa/bill@ucsc.edu duration_seconds = 43200 The 'role_arn' line contains the role and account number you are accessing. You can see a list of live account numbers here: [[AWS Account List and Numbers]] Find the account number you need and enter it on the role_arn line, as well as the role name. You will get the role name from the cluster-admin group when you get access. The 'mfa_serial' line contains the identifier for your MFA device. It will always look like '''"arn:aws:iam::652235167018:mfa/[your_iam_username]"'''. The account number listed there will always be '652235167018' because that is the account number of the top level 'gi-gateway' account. The "duration_seconds" parameter says that your session token will be 43200 seconds long (12 hours). That means you will only have to authenticate with MFA once every 12 hours. 12 hours is the maximum you can request, although you can specify less than that. This means it won't ask you for MFA every time you run a command for the next 12 hours. Once that is configured, you should be able to reference the profile you just created when using the aws command, like so: $ aws s3 ls --profile pangenomics-developer It will ask you for your MFA code and then run the command. Once you enter the MFA code, the token it creates will be valid for 12 hours if you specified "duration_seconds = 43200", or if you omitted that line, the default session duration is one hour, so you can run other 'aws' cli commands without the need to re-authenticate with MFA for the duration of the session. After the session expires, you will need to authenticate via MFA again. You can extend your session length from one hour to twelve hours but utilizing the AWS Security Token Service (AWS STS). See this page for more information on how to do this: https://docs.aws.amazon.com/cli/latest/reference/sts/assume-role.html#examples The examples at the bottom are particularly useful. bf2ff2408fbe39c523e9a8980adca855871e835f 133 132 2019-02-08T21:30:35Z Weiler 3 /* API Access and Secret Keys */ wikitext text/x-wiki __TOC__ == Getting AWS (Amazon Web Services) Access == The Genomics Institute has a series of AWS Accounts that all support different projects. Often if you become associated with one or more of those projects, you will need access to that account or accounts. The way we are managing AWS IAM Account Access is that we have one AWS account that is the 'top level' account that everyone gets access to, and then, once you log in there, you can "Switch Role" into another sub-account that you are running things in. To get access, you will need your PI or Project Manager to email cluster-admin (cluster-admin@soe.ucsc.edu) asking for an AWS account for you, and also in that email to name the projects you will have access to. The cluster-admin group will contact you with your credentials to login. Once you login, you can change your password if you want to and also you will be able to set up MFA (Multi Factor Authentication) for your account. You will be required to use MFA in order to "Switch Role" into any of the sub-accounts for the projects you are working on. The login URL to use when logging in to the top level account is listed below. The top level account is known as "gi-gateway": https://gi-gateway.signin.aws.amazon.com/console When you login, you '''may''' see a couple error messages on the AWS dashboard saying you don't have access to view certain resources - '''this is normal''', so just ignore the error messages. == Configuring Account Credentials == Once you login to the gi-gateway, you will have very few permissions to do anything there - which is normal, since you will not be working in that account anyway. The gi-gateway account is just there to authenticate you to AWS. '''Changing Your Password''' You can change your password by clicking on your username on the top right of the web browser window, just to the right of the little bell. If your username is bill@ucsc.edu, for example: * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window. * Click the '''"My Security Credentials"''' drop-down menu option. * Click the '''"Change Password"''' button to change your password. Note that we have a password strength policy in place, so your password must conform to the following requirements: * Your password must be at least 10 characters long * Your password must contain at least one lowercase letter * Your password must contain at least one non-alphanumeric character * Your password must contain at least one number You will also need to configure MFA on your account before you will be allowed to switch roles into another account. '''Configuring MFA''' To configure MFA (Multi Factor Authentication), the most common way to do it is to use '''Google Authenticator''', which is an app available for Apple and Android based cell phones and mobile devices. The app is free, simply download it from the app store to your cell phone or tablet to get started. Other MFA apps may also work but we have not tested everything out there. Once you have Google Authenticator installed, log into the gi-gateway account using the above URL, then: * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window (again, bill@ucsc.edu is an example). * Click the '''"My Security Credentials"''' drop-down menu option. * Scroll down to the MFA (Multi-Factor Authentication) section of the page, and click '''"Assign MFA Device"'''. * In the following menu select '''"Virtual MFA Device"'''. * In the following window click the '''"Show QR Code"''' link, and the MFA QR barcode will appear on your screen. * Open the Google Authenticator app on your mobile device, and click the little "+" symbol in the top right corner of the app to add an account. * You will then need to select "Scan Barcode" in the Google Authenticaor app to continue, and aim your mobile device camera at the QR barcode. * The new account MFA device should then be set up and you should see a 6 digit number with a small timer to the right of it. You must type in one 6 digit code that it displays into your web browser when asked, then wait for the next code to appear after the timer expires, and type that into the second field. It should then inform you that you have successfully associated an MFA device with your account. Once you have associated an MFA device with the 'gi-gateway' Account, '''log out''', then log back in. It will ask for your username and password, and then ask for your MFA code, which you can view by opening Google Authenticator and seeing what code it is displaying at that time. The code changes every 30 seconds or so. '''You must log out first and log back in using MFA in order to be able to switch roles!!!''' == Switching Roles into Another AWS Account == Now that you have configured a password and enabled MFA, you will be allowed to "Switch Roles" into another account such that you can begin work there. The first time you switch roles into an account it will ask you a few questions, but subsequently it will remember which roles you have access to and they will become a menu item you can click on to quickly switch roles. Let's assume that you want to switch to the 'pangenomics' AWS account, and you have been already granted access to do so by the cluster-admin group. After logging into the 'gi-gateway' account at the URL listed here (same as above): https://gi-gateway.signin.aws.amazon.com/console Do the following to switch roles into the 'pangenomics' account (as an example): * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window (again, bill@ucsc.edu is an example). * Click the '''"Switch Role"''' option in the drop-down menu. * In the following menu it will ask you about the role you will be assuming. In our example we will use the following: -Account* = pangenomics -Role* = developer -Display Name = [leave blank, or use a short phrase] -Color = [choose a color for this role] * Then click the "Switch Role" button. If all went well you should be dumped into the 'pangenomics' account, and you should be identified in the top right hand corner of the page as '''"developer @ pangenomics"''', indicating your role and the account you are active in. You can then work as normal in that account. If you have not yet been given access to that role, you will receive an error message and not be allowed to switch roles. '''NOTE:''' When you switch roles, it may dump you into a region that you don't expect it to. Always verify the region you are in by looking at the top right of the web page - it will display your region there. Most of our stuff exists in "Oregon" (us-west-2), but some items appear in other regions on a per-case basis. If you wish to switch context back to the 'gi-gateway' account in order to manage something, or to switch to another role in another account, simple: * Click '''"developer @ pangenomics"''' in the top right corner of the window. * Select '''"Back to bill@ucsc.edu"''' You will then be sent back to the 'gi-gateway' context, and you can add another role to switch into, manage your credentials and further switch roles. == API Access and Secret Keys == If you require programmatic access to AWS, you will very likely be familiar with the AWS concept of Access Keys and Secret Keys, which can be used by scripts to authenticate yourself to AWS and use the APIs there without using the web console to authenticate. In the past, access keys and secret keys could be used by users with no further authentication. This introduces a security risk, as the management of those keys must be carefully guarded - if anyone gets your keys, they can rack up charges on your AWS account without your knowledge! Using the "Assume Role" mechanism we are now using, Access Keys and Secret Keys can still be created by users '''while logged into the gi-gateway account only'''. Do not try to create keys while you have "Switched Roles" into another account. Keys you create in the top level 'gi-gateway' account will work for you in any sub-account you have access to switch roles to. You will need to do a little more configuration for your keys to work from a UNIX command line however. When using the '''"aws"''' command line tool, assuming you have it installed (the process of which is outside the scope of this document), you would use the steps outlined in this document to configure it: https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html That document has a lot of other really useful information in it - if you plan on using keys for API access, we highly recommend reading it through. Generically, if you plan on using keys for API Access, minimally you will need to configure the "aws" utility and then tweak the config a bit for our setup. To start, run "aws configure". It should look something like this: $ aws configure AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE AWS Secret Access Key [None]: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY Default region name [None]: us-west-2 Default output format [None]: Most folks do that to start. It creates two files: ~/.aws/config ~/.aws/credentials Those two files are important to access AWS via the 'aws' command. '''~/.aws/credentials''' This file contains your access key and secret key, and should not need to be modified after running 'aws configure'. Your same keys can be used to access any roles in any accounts you have access to. '''~/.aws/config''' This file contains some account information you will need to tweak. You may want to configure something like this: [default] region = us-west-2 [profile pangenomics-developer] source_profile = default role_arn = arn:aws:iam::422448306679:role/developer mfa_serial = arn:aws:iam::652235167018:mfa/bill@ucsc.edu duration_seconds = 43200 The 'role_arn' line contains the role and account number you are accessing. You can see a list of live account numbers here: [[AWS Account List and Numbers]] Find the account number you need and enter it on the role_arn line, as well as the role name. You will get the role name from the cluster-admin group when you get access. The 'mfa_serial' line contains the identifier for your MFA device. It will always look like '''"arn:aws:iam::652235167018:mfa/[your_iam_username]"'''. The account number listed there will always be '652235167018' because that is the account number of the top level 'gi-gateway' account. The "duration_seconds" parameter says that your session token will be 43200 seconds long (12 hours). That means you will only have to authenticate with MFA once every 12 hours. 12 hours is the maximum you can request, although you can specify less than that. This means it won't ask you for MFA every time you run a command for the next 12 hours. Once that is configured, you should be able to reference the profile you just created when using the aws command, like so: $ aws s3 ls --profile pangenomics-developer It will ask you for your MFA code and then run the command. Once you enter the MFA code, the token it creates will be valid for 12 hours if you specified "duration_seconds = 43200", or if you omitted that line, the default session duration is one hour, so you can run other 'aws' cli commands without the need to re-authenticate with MFA for the duration of the session. After the session expires, you will need to authenticate via MFA again. 383f386fe5cb5c240bfa4025a55bbe998b807adc 134 133 2019-02-08T22:05:40Z Weiler 3 /* API Access and Secret Keys */ wikitext text/x-wiki __TOC__ == Getting AWS (Amazon Web Services) Access == The Genomics Institute has a series of AWS Accounts that all support different projects. Often if you become associated with one or more of those projects, you will need access to that account or accounts. The way we are managing AWS IAM Account Access is that we have one AWS account that is the 'top level' account that everyone gets access to, and then, once you log in there, you can "Switch Role" into another sub-account that you are running things in. To get access, you will need your PI or Project Manager to email cluster-admin (cluster-admin@soe.ucsc.edu) asking for an AWS account for you, and also in that email to name the projects you will have access to. The cluster-admin group will contact you with your credentials to login. Once you login, you can change your password if you want to and also you will be able to set up MFA (Multi Factor Authentication) for your account. You will be required to use MFA in order to "Switch Role" into any of the sub-accounts for the projects you are working on. The login URL to use when logging in to the top level account is listed below. The top level account is known as "gi-gateway": https://gi-gateway.signin.aws.amazon.com/console When you login, you '''may''' see a couple error messages on the AWS dashboard saying you don't have access to view certain resources - '''this is normal''', so just ignore the error messages. == Configuring Account Credentials == Once you login to the gi-gateway, you will have very few permissions to do anything there - which is normal, since you will not be working in that account anyway. The gi-gateway account is just there to authenticate you to AWS. '''Changing Your Password''' You can change your password by clicking on your username on the top right of the web browser window, just to the right of the little bell. If your username is bill@ucsc.edu, for example: * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window. * Click the '''"My Security Credentials"''' drop-down menu option. * Click the '''"Change Password"''' button to change your password. Note that we have a password strength policy in place, so your password must conform to the following requirements: * Your password must be at least 10 characters long * Your password must contain at least one lowercase letter * Your password must contain at least one non-alphanumeric character * Your password must contain at least one number You will also need to configure MFA on your account before you will be allowed to switch roles into another account. '''Configuring MFA''' To configure MFA (Multi Factor Authentication), the most common way to do it is to use '''Google Authenticator''', which is an app available for Apple and Android based cell phones and mobile devices. The app is free, simply download it from the app store to your cell phone or tablet to get started. Other MFA apps may also work but we have not tested everything out there. Once you have Google Authenticator installed, log into the gi-gateway account using the above URL, then: * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window (again, bill@ucsc.edu is an example). * Click the '''"My Security Credentials"''' drop-down menu option. * Scroll down to the MFA (Multi-Factor Authentication) section of the page, and click '''"Assign MFA Device"'''. * In the following menu select '''"Virtual MFA Device"'''. * In the following window click the '''"Show QR Code"''' link, and the MFA QR barcode will appear on your screen. * Open the Google Authenticator app on your mobile device, and click the little "+" symbol in the top right corner of the app to add an account. * You will then need to select "Scan Barcode" in the Google Authenticaor app to continue, and aim your mobile device camera at the QR barcode. * The new account MFA device should then be set up and you should see a 6 digit number with a small timer to the right of it. You must type in one 6 digit code that it displays into your web browser when asked, then wait for the next code to appear after the timer expires, and type that into the second field. It should then inform you that you have successfully associated an MFA device with your account. Once you have associated an MFA device with the 'gi-gateway' Account, '''log out''', then log back in. It will ask for your username and password, and then ask for your MFA code, which you can view by opening Google Authenticator and seeing what code it is displaying at that time. The code changes every 30 seconds or so. '''You must log out first and log back in using MFA in order to be able to switch roles!!!''' == Switching Roles into Another AWS Account == Now that you have configured a password and enabled MFA, you will be allowed to "Switch Roles" into another account such that you can begin work there. The first time you switch roles into an account it will ask you a few questions, but subsequently it will remember which roles you have access to and they will become a menu item you can click on to quickly switch roles. Let's assume that you want to switch to the 'pangenomics' AWS account, and you have been already granted access to do so by the cluster-admin group. After logging into the 'gi-gateway' account at the URL listed here (same as above): https://gi-gateway.signin.aws.amazon.com/console Do the following to switch roles into the 'pangenomics' account (as an example): * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window (again, bill@ucsc.edu is an example). * Click the '''"Switch Role"''' option in the drop-down menu. * In the following menu it will ask you about the role you will be assuming. In our example we will use the following: -Account* = pangenomics -Role* = developer -Display Name = [leave blank, or use a short phrase] -Color = [choose a color for this role] * Then click the "Switch Role" button. If all went well you should be dumped into the 'pangenomics' account, and you should be identified in the top right hand corner of the page as '''"developer @ pangenomics"''', indicating your role and the account you are active in. You can then work as normal in that account. If you have not yet been given access to that role, you will receive an error message and not be allowed to switch roles. '''NOTE:''' When you switch roles, it may dump you into a region that you don't expect it to. Always verify the region you are in by looking at the top right of the web page - it will display your region there. Most of our stuff exists in "Oregon" (us-west-2), but some items appear in other regions on a per-case basis. If you wish to switch context back to the 'gi-gateway' account in order to manage something, or to switch to another role in another account, simple: * Click '''"developer @ pangenomics"''' in the top right corner of the window. * Select '''"Back to bill@ucsc.edu"''' You will then be sent back to the 'gi-gateway' context, and you can add another role to switch into, manage your credentials and further switch roles. == API Access and Secret Keys == If you require programmatic access to AWS, you will very likely be familiar with the AWS concept of Access Keys and Secret Keys, which can be used by scripts to authenticate yourself to AWS and use the APIs there without using the web console to authenticate. In the past, access keys and secret keys could be used by users with no further authentication. This introduces a security risk, as the management of those keys must be carefully guarded - if anyone gets your keys, they can rack up charges on your AWS account without your knowledge! Using the "Assume Role" mechanism we are now using, Access Keys and Secret Keys can still be created by users '''while logged into the gi-gateway account only'''. Do not try to create keys while you have "Switched Roles" into another account. Keys you create in the top level 'gi-gateway' account will work for you in any sub-account you have access to switch roles to. You will need to do a little more configuration for your keys to work from a UNIX command line however. When using the '''"aws"''' command line tool, assuming you have it installed (the process of which is outside the scope of this document), you would use the steps outlined in this document to configure it: https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html That document has a lot of other really useful information in it - if you plan on using keys for API access, we highly recommend reading it through. Generically, if you plan on using keys for API Access, minimally you will need to configure the "aws" utility and then tweak the config a bit for our setup. To start, run "aws configure". It should look something like this: $ aws configure AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE AWS Secret Access Key [None]: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY Default region name [None]: us-west-2 Default output format [None]: Most folks do that to start. It creates two files: ~/.aws/config ~/.aws/credentials Those two files are important to access AWS via the 'aws' command. '''~/.aws/credentials''' This file contains your access key and secret key, and should not need to be modified after running 'aws configure'. Your same keys can be used to access any roles in any accounts you have access to. '''~/.aws/config''' This file contains some account information you will need to tweak. You may want to configure something like this: [default] region = us-west-2 [profile pangenomics-developer] source_profile = default role_arn = arn:aws:iam::422448306679:role/developer mfa_serial = arn:aws:iam::652235167018:mfa/bill@ucsc.edu duration_seconds = 43200 The "role_arn" line contains the role and account number you are accessing. You can see a list of live account numbers here: [[AWS Account List and Numbers]] Find the account number you need and enter it on the role_arn line, as well as the role name. You will get the role name from the cluster-admin group when you get access. The 'mfa_serial' line contains the identifier for your MFA device. It will always look like '''"arn:aws:iam::652235167018:mfa/[your_iam_username]"'''. The account number listed there will always be "652235167018" because that is the account number of the top level "gi-gateway" account. The "duration_seconds" parameter says that your session token will be 43200 seconds long (12 hours). That means you will only have to authenticate with MFA once every 12 hours. 12 hours is the maximum you can request, although you can specify less than that. This means it won't ask you for MFA every time you run a command for the next 12 hours. Once that is configured, you should be able to reference the profile you just created when using the aws command, like so: $ aws s3 ls --profile pangenomics-developer It will ask you for your MFA code and then run the command. Once you enter the MFA code, the token it creates will be valid for 12 hours if you specified "duration_seconds = 43200", or if you omitted that line, the default session duration is one hour, so you can run other 'aws' cli commands without the need to re-authenticate with MFA for the duration of the session. After the session expires, you will need to authenticate via MFA again. 3ef08e4f6d1fa4ec5fe8972e912c0a642ead2698 135 134 2019-02-08T22:06:42Z Weiler 3 /* API Access and Secret Keys */ wikitext text/x-wiki __TOC__ == Getting AWS (Amazon Web Services) Access == The Genomics Institute has a series of AWS Accounts that all support different projects. Often if you become associated with one or more of those projects, you will need access to that account or accounts. The way we are managing AWS IAM Account Access is that we have one AWS account that is the 'top level' account that everyone gets access to, and then, once you log in there, you can "Switch Role" into another sub-account that you are running things in. To get access, you will need your PI or Project Manager to email cluster-admin (cluster-admin@soe.ucsc.edu) asking for an AWS account for you, and also in that email to name the projects you will have access to. The cluster-admin group will contact you with your credentials to login. Once you login, you can change your password if you want to and also you will be able to set up MFA (Multi Factor Authentication) for your account. You will be required to use MFA in order to "Switch Role" into any of the sub-accounts for the projects you are working on. The login URL to use when logging in to the top level account is listed below. The top level account is known as "gi-gateway": https://gi-gateway.signin.aws.amazon.com/console When you login, you '''may''' see a couple error messages on the AWS dashboard saying you don't have access to view certain resources - '''this is normal''', so just ignore the error messages. == Configuring Account Credentials == Once you login to the gi-gateway, you will have very few permissions to do anything there - which is normal, since you will not be working in that account anyway. The gi-gateway account is just there to authenticate you to AWS. '''Changing Your Password''' You can change your password by clicking on your username on the top right of the web browser window, just to the right of the little bell. If your username is bill@ucsc.edu, for example: * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window. * Click the '''"My Security Credentials"''' drop-down menu option. * Click the '''"Change Password"''' button to change your password. Note that we have a password strength policy in place, so your password must conform to the following requirements: * Your password must be at least 10 characters long * Your password must contain at least one lowercase letter * Your password must contain at least one non-alphanumeric character * Your password must contain at least one number You will also need to configure MFA on your account before you will be allowed to switch roles into another account. '''Configuring MFA''' To configure MFA (Multi Factor Authentication), the most common way to do it is to use '''Google Authenticator''', which is an app available for Apple and Android based cell phones and mobile devices. The app is free, simply download it from the app store to your cell phone or tablet to get started. Other MFA apps may also work but we have not tested everything out there. Once you have Google Authenticator installed, log into the gi-gateway account using the above URL, then: * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window (again, bill@ucsc.edu is an example). * Click the '''"My Security Credentials"''' drop-down menu option. * Scroll down to the MFA (Multi-Factor Authentication) section of the page, and click '''"Assign MFA Device"'''. * In the following menu select '''"Virtual MFA Device"'''. * In the following window click the '''"Show QR Code"''' link, and the MFA QR barcode will appear on your screen. * Open the Google Authenticator app on your mobile device, and click the little "+" symbol in the top right corner of the app to add an account. * You will then need to select "Scan Barcode" in the Google Authenticaor app to continue, and aim your mobile device camera at the QR barcode. * The new account MFA device should then be set up and you should see a 6 digit number with a small timer to the right of it. You must type in one 6 digit code that it displays into your web browser when asked, then wait for the next code to appear after the timer expires, and type that into the second field. It should then inform you that you have successfully associated an MFA device with your account. Once you have associated an MFA device with the 'gi-gateway' Account, '''log out''', then log back in. It will ask for your username and password, and then ask for your MFA code, which you can view by opening Google Authenticator and seeing what code it is displaying at that time. The code changes every 30 seconds or so. '''You must log out first and log back in using MFA in order to be able to switch roles!!!''' == Switching Roles into Another AWS Account == Now that you have configured a password and enabled MFA, you will be allowed to "Switch Roles" into another account such that you can begin work there. The first time you switch roles into an account it will ask you a few questions, but subsequently it will remember which roles you have access to and they will become a menu item you can click on to quickly switch roles. Let's assume that you want to switch to the 'pangenomics' AWS account, and you have been already granted access to do so by the cluster-admin group. After logging into the 'gi-gateway' account at the URL listed here (same as above): https://gi-gateway.signin.aws.amazon.com/console Do the following to switch roles into the 'pangenomics' account (as an example): * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window (again, bill@ucsc.edu is an example). * Click the '''"Switch Role"''' option in the drop-down menu. * In the following menu it will ask you about the role you will be assuming. In our example we will use the following: -Account* = pangenomics -Role* = developer -Display Name = [leave blank, or use a short phrase] -Color = [choose a color for this role] * Then click the "Switch Role" button. If all went well you should be dumped into the 'pangenomics' account, and you should be identified in the top right hand corner of the page as '''"developer @ pangenomics"''', indicating your role and the account you are active in. You can then work as normal in that account. If you have not yet been given access to that role, you will receive an error message and not be allowed to switch roles. '''NOTE:''' When you switch roles, it may dump you into a region that you don't expect it to. Always verify the region you are in by looking at the top right of the web page - it will display your region there. Most of our stuff exists in "Oregon" (us-west-2), but some items appear in other regions on a per-case basis. If you wish to switch context back to the 'gi-gateway' account in order to manage something, or to switch to another role in another account, simple: * Click '''"developer @ pangenomics"''' in the top right corner of the window. * Select '''"Back to bill@ucsc.edu"''' You will then be sent back to the 'gi-gateway' context, and you can add another role to switch into, manage your credentials and further switch roles. == API Access and Secret Keys == If you require programmatic access to AWS, you will very likely be familiar with the AWS concept of Access Keys and Secret Keys, which can be used by scripts to authenticate yourself to AWS and use the APIs there without using the web console to authenticate. In the past, access keys and secret keys could be used by users with no further authentication. This introduces a security risk, as the management of those keys must be carefully guarded - if anyone gets your keys, they can rack up charges on your AWS account without your knowledge! Using the "Assume Role" mechanism we are now using, Access Keys and Secret Keys can still be created by users '''while logged into the gi-gateway account only'''. Do not try to create keys while you have "Switched Roles" into another account. Keys you create in the top level 'gi-gateway' account will work for you in any sub-account you have access to switch roles to. You will need to do a little more configuration for your keys to work from a UNIX command line however. When using the '''"awscli"''' command line tool, assuming you have it installed (the process of which is outside the scope of this document), you would use the steps outlined in this document to configure it: https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html That document has a lot of other really useful information in it - if you plan on using keys for API access, we highly recommend reading it through. Generically, if you plan on using keys for API Access, minimally you will need to configure the "aws" utility and then tweak the config a bit for our setup. To start, run "aws configure". It should look something like this: $ aws configure AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE AWS Secret Access Key [None]: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY Default region name [None]: us-west-2 Default output format [None]: Most folks do that to start. It creates two files: ~/.aws/config ~/.aws/credentials Those two files are important to access AWS via the 'aws' command. '''~/.aws/credentials''' This file contains your access key and secret key, and should not need to be modified after running 'aws configure'. Your same keys can be used to access any roles in any accounts you have access to. '''~/.aws/config''' This file contains some account information you will need to tweak. You may want to configure something like this: [default] region = us-west-2 [profile pangenomics-developer] source_profile = default role_arn = arn:aws:iam::422448306679:role/developer mfa_serial = arn:aws:iam::652235167018:mfa/bill@ucsc.edu duration_seconds = 43200 The "role_arn" line contains the role and account number you are accessing. You can see a list of live account numbers here: [[AWS Account List and Numbers]] Find the account number you need and enter it on the role_arn line, as well as the role name. You will get the role name from the cluster-admin group when you get access. The 'mfa_serial' line contains the identifier for your MFA device. It will always look like '''"arn:aws:iam::652235167018:mfa/[your_iam_username]"'''. The account number listed there will always be "652235167018" because that is the account number of the top level "gi-gateway" account. The "duration_seconds" parameter says that your session token will be 43200 seconds long (12 hours). That means you will only have to authenticate with MFA once every 12 hours. 12 hours is the maximum you can request, although you can specify less than that. This means it won't ask you for MFA every time you run a command for the next 12 hours. Once that is configured, you should be able to reference the profile you just created when using the aws command, like so: $ aws s3 ls --profile pangenomics-developer It will ask you for your MFA code and then run the command. Once you enter the MFA code, the token it creates will be valid for 12 hours if you specified "duration_seconds = 43200", or if you omitted that line, the default session duration is one hour, so you can run other 'aws' cli commands without the need to re-authenticate with MFA for the duration of the session. After the session expires, you will need to authenticate via MFA again. 48e248560f22bcbc32aa1f44a63873d48eed0004 141 135 2019-04-26T20:13:37Z Weiler 3 wikitext text/x-wiki __TOC__ == Getting AWS (Amazon Web Services) Access == The Genomics Institute has a series of AWS Accounts that all support different projects. Often if you become associated with one or more of those projects, you will need access to that account or accounts. The way we are managing AWS IAM Account Access is that we have one AWS account that is the 'top level' account that everyone gets access to, and then, once you log in there, you can "Switch Role" into another sub-account that you are running things in. To get access, you will need your PI or Project Manager to email cluster-admin (cluster-admin@soe.ucsc.edu) asking for an AWS account for you, and also in that email to name the projects you will have access to. The cluster-admin group will contact you with your credentials to login. Once you login, you can change your password if you want to and also you will be able to set up MFA (Multi Factor Authentication) for your account. You will be required to use MFA in order to "Switch Role" into any of the sub-accounts for the projects you are working on. The login URL to use when logging in to the top level account is listed below. The top level account is known as "gi-gateway": https://gi-gateway.signin.aws.amazon.com/console When you login, you '''may''' see a couple error messages on the AWS dashboard saying you don't have access to view certain resources - '''this is normal''', so just ignore the error messages. == Configuring Account Credentials == Once you login to the gi-gateway, you will have very few permissions to do anything there - which is normal, since you will not be working in that account anyway. The gi-gateway account is just there to authenticate you to AWS. '''Changing Your Password''' You can change your password by clicking on your username on the top right of the web browser window, just to the right of the little bell. If your username is bill@ucsc.edu, for example: * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window. * Click the '''"My Security Credentials"''' drop-down menu option. * Click the '''"Change Password"''' button to change your password. Note that we have a password strength policy in place, so your password must conform to the following requirements: * Your password must be at least 10 characters long * Your password must contain at least one lowercase letter * Your password must contain at least one non-alphanumeric character * Your password must contain at least one number You will also need to configure MFA on your account before you will be allowed to switch roles into another account. '''Configuring MFA''' To configure MFA (Multi Factor Authentication), the most common way to do it is to use '''Google Authenticator''', which is an app available for Apple and Android based cell phones and mobile devices. The app is free, simply download it from the app store to your cell phone or tablet to get started. Other MFA apps may also work but we have not tested everything out there. Once you have Google Authenticator installed, log into the gi-gateway account using the above URL, then: * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window (again, bill@ucsc.edu is an example). * Click the '''"My Security Credentials"''' drop-down menu option. * Scroll down to the MFA (Multi-Factor Authentication) section of the page, and click '''"Assign MFA Device"'''. * In the following menu select '''"Virtual MFA Device"'''. * In the following window click the '''"Show QR Code"''' link, and the MFA QR barcode will appear on your screen. * Open the Google Authenticator app on your mobile device, and click the little "+" symbol in the top right corner of the app to add an account. * You will then need to select "Scan Barcode" in the Google Authenticaor app to continue, and aim your mobile device camera at the QR barcode. * The new account MFA device should then be set up and you should see a 6 digit number with a small timer to the right of it. You must type in one 6 digit code that it displays into your web browser when asked, then wait for the next code to appear after the timer expires, and type that into the second field. It should then inform you that you have successfully associated an MFA device with your account. Once you have associated an MFA device with the 'gi-gateway' Account, '''log out''', then log back in. It will ask for your username and password, and then ask for your MFA code, which you can view by opening Google Authenticator and seeing what code it is displaying at that time. The code changes every 30 seconds or so. '''You must log out first and log back in using MFA in order to be able to switch roles!!!''' == Switching Roles into Another AWS Account == Now that you have configured a password and enabled MFA, you will be allowed to "Switch Roles" into another account such that you can begin work there. The first time you switch roles into an account it will ask you a few questions, but subsequently it will remember which roles you have access to and they will become a menu item you can click on to quickly switch roles. Let's assume that you want to switch to the 'pangenomics' AWS account, and you have been already granted access to do so by the cluster-admin group. After logging into the 'gi-gateway' account at the URL listed here (same as above): https://gi-gateway.signin.aws.amazon.com/console Do the following to switch roles into the 'pangenomics' account (as an example): * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window (again, bill@ucsc.edu is an example). * Click the '''"Switch Role"''' option in the drop-down menu. * In the following menu it will ask you about the role you will be assuming. In our example we will use the following: -Account* = pangenomics -Role* = developer -Display Name = [leave blank, or use a short phrase] -Color = [choose a color for this role] * Then click the "Switch Role" button. If all went well you should be dumped into the 'pangenomics' account, and you should be identified in the top right hand corner of the page as '''"developer @ pangenomics"''', indicating your role and the account you are active in. You can then work as normal in that account. If you have not yet been given access to that role, you will receive an error message and not be allowed to switch roles. '''NOTE:''' When you switch roles, it may dump you into a region that you don't expect it to. Always verify the region you are in by looking at the top right of the web page - it will display your region there. Most of our stuff exists in "Oregon" (us-west-2), but some items appear in other regions on a per-case basis. If you wish to switch context back to the 'gi-gateway' account in order to manage something, or to switch to another role in another account, simple: * Click '''"developer @ pangenomics"''' in the top right corner of the window. * Select '''"Back to bill@ucsc.edu"''' You will then be sent back to the 'gi-gateway' context, and you can add another role to switch into, manage your credentials and further switch roles. == API Access and Secret Keys == If you require programmatic access to AWS, you will very likely be familiar with the AWS concept of Access Keys and Secret Keys, which can be used by scripts to authenticate yourself to AWS and use the APIs there without using the web console to authenticate. In the past, access keys and secret keys could be used by users with no further authentication. This introduces a security risk, as the management of those keys must be carefully guarded - if anyone gets your keys, they can rack up charges on your AWS account without your knowledge! Using the "Assume Role" mechanism we are now using, Access Keys and Secret Keys can still be created by users '''while logged into the gi-gateway account only'''. Do not try to create keys while you have "Switched Roles" into another account. Keys you create in the top level 'gi-gateway' account will work for you in any sub-account you have access to switch roles to. You will need to do a little more configuration for your keys to work from a UNIX command line however. When using the '''"awscli"''' command line tool, assuming you have it installed (the process of which is outside the scope of this document), you would use the steps outlined in this document to configure it: https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html That document has a lot of other really useful information in it - if you plan on using keys for API access, we highly recommend reading it through. It should be noted that we recommend awscli version 1.16.x or later, as earlier versions have documented issues with using profiles and MFA related actions. You can determine your version of awscli by doing: aws --version Generically, if you plan on using keys for API Access, minimally you will need to configure the "aws" utility and then tweak the config a bit for our setup. To start, run "aws configure". It should look something like this: $ aws configure AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE AWS Secret Access Key [None]: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY Default region name [None]: us-west-2 Default output format [None]: Most folks do that to start. It creates two files: ~/.aws/config ~/.aws/credentials Those two files are important to access AWS via the 'aws' command. '''~/.aws/credentials''' This file contains your access key and secret key, and should not need to be modified after running 'aws configure'. Your same keys can be used to access any roles in any accounts you have access to. '''~/.aws/config''' This file contains some account information you will need to tweak. You may want to configure something like this: [default] region = us-west-2 [profile pangenomics-developer] source_profile = default role_arn = arn:aws:iam::422448306679:role/developer mfa_serial = arn:aws:iam::652235167018:mfa/bill@ucsc.edu duration_seconds = 43200 The "role_arn" line contains the role and account number you are accessing. You can see a list of live account numbers here: [[AWS Account List and Numbers]] Find the account number you need and enter it on the role_arn line, as well as the role name. You will get the role name from the cluster-admin group when you get access. The 'mfa_serial' line contains the identifier for your MFA device. It will always look like '''"arn:aws:iam::652235167018:mfa/[your_iam_username]"'''. The account number listed there will always be "652235167018" because that is the account number of the top level "gi-gateway" account. The "duration_seconds" parameter says that your session token will be 43200 seconds long (12 hours). That means you will only have to authenticate with MFA once every 12 hours. 12 hours is the maximum you can request, although you can specify less than that. This means it won't ask you for MFA every time you run a command for the next 12 hours. Once that is configured, you should be able to reference the profile you just created when using the aws command, like so: $ aws s3 ls --profile pangenomics-developer It will ask you for your MFA code and then run the command. Once you enter the MFA code, the token it creates will be valid for 12 hours if you specified "duration_seconds = 43200", or if you omitted that line, the default session duration is one hour, so you can run other 'aws' cli commands without the need to re-authenticate with MFA for the duration of the session. After the session expires, you will need to authenticate via MFA again. 79012bcb0fc944278e1c1bbea0f9cf6490475f5d 142 141 2019-06-26T22:39:00Z Weiler 3 wikitext text/x-wiki __TOC__ == Getting AWS (Amazon Web Services) Access == The Genomics Institute has a series of AWS Accounts that all support different projects. Often if you become associated with one or more of those projects, you will need access to that account or accounts. The way we are managing AWS IAM Account Access is that we have one AWS account that is the 'top level' account that everyone gets access to, and then, once you log in there, you can "Switch Role" into another sub-account that you are running things in. To get access, you will need your PI or Project Manager to email cluster-admin (cluster-admin@soe.ucsc.edu) asking for an AWS account for you, and also in that email to name the projects you will have access to. The cluster-admin group will contact you with your credentials to login. Once you login, you can change your password if you want to and also you will be able to set up MFA (Multi Factor Authentication) for your account. You will be required to use MFA in order to "Switch Role" into any of the sub-accounts for the projects you are working on. The login URL to use when logging in to the top level account is listed below. The top level account is known as "gi-gateway": https://gi-gateway.signin.aws.amazon.com/console When you login, you '''may''' see a couple error messages on the AWS dashboard saying you don't have access to view certain resources - '''this is normal''', so just ignore the error messages. == Configuring Account Credentials == Once you login to the gi-gateway, you will have very few permissions to do anything there - which is normal, since you will not be working in that account anyway. The gi-gateway account is just there to authenticate you to AWS. '''Changing Your Password''' You can change your password by clicking on your username on the top right of the web browser window, just to the right of the little bell. If your username is bill@ucsc.edu, for example: * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window. * Click the '''"My Security Credentials"''' drop-down menu option. * Click the '''"Change Password"''' button to change your password. Note that we have a password strength policy in place, so your password must conform to the following requirements: * Your password must be at least 10 characters long * Your password must contain at least one lowercase letter * Your password must contain at least one non-alphanumeric character * Your password must contain at least one number You will also need to configure MFA on your account before you will be allowed to switch roles into another account. '''Configuring MFA''' To configure MFA (Multi Factor Authentication), the most common way to do it is to use '''Google Authenticator''', which is an app available for Apple and Android based cell phones and mobile devices. The app is free, simply download it from the app store to your cell phone or tablet to get started. Other MFA apps may also work but we have not tested everything out there. Once you have Google Authenticator installed, log into the gi-gateway account using the above URL, then: * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window (again, bill@ucsc.edu is an example). * Click the '''"My Security Credentials"''' drop-down menu option. * Scroll down to the MFA (Multi-Factor Authentication) section of the page, and click '''"Assign MFA Device"'''. * In the following menu select '''"Virtual MFA Device"'''. * In the following window click the '''"Show QR Code"''' link, and the MFA QR barcode will appear on your screen. * Open the Google Authenticator app on your mobile device, and click the little "+" symbol in the top right corner of the app to add an account. * You will then need to select "Scan Barcode" in the Google Authenticaor app to continue, and aim your mobile device camera at the QR barcode. * The new account MFA device should then be set up and you should see a 6 digit number with a small timer to the right of it. You must type in one 6 digit code that it displays into your web browser when asked, then wait for the next code to appear after the timer expires, and type that into the second field. It should then inform you that you have successfully associated an MFA device with your account. Once you have associated an MFA device with the 'gi-gateway' Account, '''log out''', then log back in. It will ask for your username and password, and then ask for your MFA code, which you can view by opening Google Authenticator and seeing what code it is displaying at that time. The code changes every 30 seconds or so. '''You must log out first and log back in using MFA in order to be able to switch roles!!!''' == Switching Roles into Another AWS Account == Now that you have configured a password and enabled MFA, you will be allowed to "Switch Roles" into another account such that you can begin work there. The first time you switch roles into an account it will ask you a few questions, but subsequently it will remember which roles you have access to and they will become a menu item you can click on to quickly switch roles. Let's assume that you want to switch to the 'pangenomics' AWS account, and you have been already granted access to do so by the cluster-admin group. After logging into the 'gi-gateway' account at the URL listed here (same as above): https://gi-gateway.signin.aws.amazon.com/console Do the following to switch roles into the 'pangenomics' account (as an example): * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window (again, bill@ucsc.edu is an example). * Click the '''"Switch Role"''' option in the drop-down menu. * In the following menu it will ask you about the role you will be assuming. In our example we will use the following: -Account* = pangenomics -Role* = developer -Display Name = [leave blank, or use a short phrase] -Color = [choose a color for this role] * Then click the "Switch Role" button. If all went well you should be dumped into the 'pangenomics' account, and you should be identified in the top right hand corner of the page as '''"developer @ pangenomics"''', indicating your role and the account you are active in. You can then work as normal in that account. If you have not yet been given access to that role, you will receive an error message and not be allowed to switch roles. '''NOTE:''' When you switch roles, it may dump you into a region that you don't expect it to. Always verify the region you are in by looking at the top right of the web page - it will display your region there. Most of our stuff exists in "Oregon" (us-west-2), but some items appear in other regions on a per-case basis. If you wish to switch context back to the 'gi-gateway' account in order to manage something, or to switch to another role in another account, simple: * Click '''"developer @ pangenomics"''' in the top right corner of the window. * Select '''"Back to bill@ucsc.edu"''' You will then be sent back to the 'gi-gateway' context, and you can add another role to switch into, manage your credentials and further switch roles. == API Access and Secret Keys == If you require programmatic access to AWS, you will very likely be familiar with the AWS concept of Access Keys and Secret Keys, which can be used by scripts to authenticate yourself to AWS and use the APIs there without using the web console to authenticate. In the past, access keys and secret keys could be used by users with no further authentication. This introduces a security risk, as the management of those keys must be carefully guarded - if anyone gets your keys, they can rack up charges on your AWS account without your knowledge! Using the "Assume Role" mechanism we are now using, Access Keys and Secret Keys can still be created by users '''while logged into the gi-gateway account only'''. Do not try to create keys while you have "Switched Roles" into another account. Keys you create in the top level 'gi-gateway' account will work for you in any sub-account you have access to switch roles to. You will need to do a little more configuration for your keys to work from a UNIX command line however. When using the '''"awscli"''' command line tool, assuming you have it installed (the process of which is outside the scope of this document), you would use the steps outlined in this document to configure it: https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html That document has a lot of other really useful information in it - if you plan on using keys for API access, we highly recommend reading it through. It should be noted that we recommend awscli version 1.16.187 or later, as earlier versions have documented issues with using profiles and MFA related actions. You can determine your version of awscli by doing: aws --version Generically, if you plan on using keys for API Access, minimally you will need to configure the "aws" utility and then tweak the config a bit for our setup. To start, run "aws configure". It should look something like this: $ aws configure AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE AWS Secret Access Key [None]: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY Default region name [None]: us-west-2 Default output format [None]: Most folks do that to start. It creates two files: ~/.aws/config ~/.aws/credentials Those two files are important to access AWS via the 'aws' command. '''~/.aws/credentials''' This file contains your access key and secret key, and should not need to be modified after running 'aws configure'. Your same keys can be used to access any roles in any accounts you have access to. '''~/.aws/config''' This file contains some account information you will need to tweak. You may want to configure something like this: [default] region = us-west-2 [profile pangenomics-developer] source_profile = default role_arn = arn:aws:iam::422448306679:role/developer mfa_serial = arn:aws:iam::652235167018:mfa/bill@ucsc.edu duration_seconds = 43200 The "role_arn" line contains the role and account number you are accessing. You can see a list of live account numbers here: [[AWS Account List and Numbers]] Find the account number you need and enter it on the role_arn line, as well as the role name. You will get the role name from the cluster-admin group when you get access. The 'mfa_serial' line contains the identifier for your MFA device. It will always look like '''"arn:aws:iam::652235167018:mfa/[your_iam_username]"'''. The account number listed there will always be "652235167018" because that is the account number of the top level "gi-gateway" account. The "duration_seconds" parameter says that your session token will be 43200 seconds long (12 hours). That means you will only have to authenticate with MFA once every 12 hours. 12 hours is the maximum you can request, although you can specify less than that. This means it won't ask you for MFA every time you run a command for the next 12 hours. Once that is configured, you should be able to reference the profile you just created when using the aws command, like so: $ aws s3 ls --profile pangenomics-developer It will ask you for your MFA code and then run the command. Once you enter the MFA code, the token it creates will be valid for 12 hours if you specified "duration_seconds = 43200", or if you omitted that line, the default session duration is one hour, so you can run other 'aws' cli commands without the need to re-authenticate with MFA for the duration of the session. After the session expires, you will need to authenticate via MFA again. d0950e15e6ec5728bf50b70cb48a107610ad2219 147 142 2019-09-04T19:41:39Z Weiler 3 wikitext text/x-wiki __TOC__ == Getting AWS (Amazon Web Services) Access == The Genomics Institute has a series of AWS Accounts that all support different projects. Often if you become associated with one or more of those projects, you will need access to that account or accounts. The way we are managing AWS IAM Account Access is that we have one AWS account that is the 'top level' account that everyone gets access to, and then, once you log in there, you can "Switch Role" into another sub-account that you are running things in. To get access, you will need your PI or Project Manager to email cluster-admin (cluster-admin@soe.ucsc.edu) asking for an AWS account for you, and also in that email to name the projects you will have access to. The cluster-admin group will contact you with your credentials to login. Once you login, you can change your password if you want to and also you will be able to set up MFA (Multi Factor Authentication) for your account. You will be required to use MFA in order to "Switch Role" into any of the sub-accounts for the projects you are working on. The login URL to use when logging in to the top level account is listed below. The top level account is known as "gi-gateway": https://gi-gateway.signin.aws.amazon.com/console When you login, you '''may''' see a couple error messages on the AWS dashboard saying you don't have access to view certain resources - '''this is normal''', so just ignore the error messages. == Configuring Account Credentials == Once you login to the gi-gateway, you will have very few permissions to do anything there - which is normal, since you will not be working in that account anyway. The gi-gateway account is just there to authenticate you to AWS. '''Changing Your Password''' You can change your password by clicking on your username on the top right of the web browser window, just to the right of the little bell. If your username is bill@ucsc.edu, for example: * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window. * Click the '''"My Security Credentials"''' drop-down menu option. * Click the '''"Change Password"''' button to change your password. Note that we have a password strength policy in place, so your password must conform to the following requirements: * Your password must be at least 10 characters long * Your password must contain at least one lowercase letter * Your password must contain at least one non-alphanumeric character * Your password must contain at least one number You will also need to configure MFA on your account before you will be allowed to switch roles into another account. '''Configuring MFA''' To configure MFA (Multi Factor Authentication), the most common way to do it is to use '''Google Authenticator''', which is an app available for Apple and Android based cell phones and mobile devices. The app is free, simply download it from the app store to your cell phone or tablet to get started. Other MFA apps may also work but we have not tested everything out there. Once you have Google Authenticator installed, log into the gi-gateway account using the above URL, then: * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window (again, bill@ucsc.edu is an example). * Click the '''"My Security Credentials"''' drop-down menu option. * Scroll down to the MFA (Multi-Factor Authentication) section of the page, and click '''"Assign MFA Device"'''. * In the following menu select '''"Virtual MFA Device"'''. * In the following window click the '''"Show QR Code"''' link, and the MFA QR barcode will appear on your screen. * Open the Google Authenticator app on your mobile device, and click the little "+" symbol in the top right corner of the app to add an account. * You will then need to select "Scan Barcode" in the Google Authenticaor app to continue, and aim your mobile device camera at the QR barcode. * The new account MFA device should then be set up and you should see a 6 digit number with a small timer to the right of it. You must type in one 6 digit code that it displays into your web browser when asked, then wait for the next code to appear after the timer expires, and type that into the second field. It should then inform you that you have successfully associated an MFA device with your account. Once you have associated an MFA device with the 'gi-gateway' Account, '''log out''', then log back in. It will ask for your username and password, and then ask for your MFA code, which you can view by opening Google Authenticator and seeing what code it is displaying at that time. The code changes every 30 seconds or so. '''You must log out first and log back in using MFA in order to be able to switch roles!!!''' == Switching Roles into Another AWS Account == Now that you have configured a password and enabled MFA, you will be allowed to "Switch Roles" into another account such that you can begin work there. The first time you switch roles into an account it will ask you a few questions, but subsequently it will remember which roles you have access to and they will become a menu item you can click on to quickly switch roles. Let's assume that you want to switch to the 'pangenomics' AWS account, and you have been already granted access to do so by the cluster-admin group. After logging into the 'gi-gateway' account at the URL listed here (same as above): https://gi-gateway.signin.aws.amazon.com/console Do the following to switch roles into the 'pangenomics' account (as an example): * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window (again, bill@ucsc.edu is an example). * Click the '''"Switch Role"''' option in the drop-down menu. * In the following menu it will ask you about the role you will be assuming. In our example we will use the following: -Account* = pangenomics -Role* = developer -Display Name = [leave blank, or use a short phrase] -Color = [choose a color for this role] * Then click the "Switch Role" button. If all went well you should be dumped into the 'pangenomics' account, and you should be identified in the top right hand corner of the page as '''"developer @ pangenomics"''', indicating your role and the account you are active in. You can then work as normal in that account. If you have not yet been given access to that role, you will receive an error message and not be allowed to switch roles. '''NOTE:''' When you switch roles, it may dump you into a region that you don't expect it to. Always verify the region you are in by looking at the top right of the web page - it will display your region there. Most of our stuff exists in "Oregon" (us-west-2), but some items appear in other regions on a per-case basis. If you wish to switch context back to the 'gi-gateway' account in order to manage something, or to switch to another role in another account, simple: * Click '''"developer @ pangenomics"''' in the top right corner of the window. * Select '''"Back to bill@ucsc.edu"''' You will then be sent back to the 'gi-gateway' context, and you can add another role to switch into, manage your credentials and further switch roles. == API Access and Secret Keys == If you require programmatic access to AWS, you will very likely be familiar with the AWS concept of Access Keys and Secret Keys, which can be used by scripts to authenticate yourself to AWS and use the APIs there without using the web console to authenticate. In the past, access keys and secret keys could be used by users with no further authentication. This introduces a security risk, as the management of those keys must be carefully guarded - if anyone gets your keys, they can rack up charges on your AWS account without your knowledge! Using the "Assume Role" mechanism we are now using, Access Keys and Secret Keys can still be created by users '''while logged into the gi-gateway account only'''. Do not try to create keys while you have "Switched Roles" into another account. Keys you create in the top level 'gi-gateway' account will work for you in any sub-account you have access to switch roles to. You will need to do a little more configuration for your keys to work from a UNIX command line however. When using the '''"awscli"''' command line tool, assuming you have it installed (the process of which is outside the scope of this document), you would use the steps outlined in this document to configure it: https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html That document has a lot of other really useful information in it - if you plan on using keys for API access, we highly recommend reading it through. It should be noted that we recommend awscli version 1.16.187 or later, as earlier versions have documented issues with using profiles and MFA related actions. You can determine your version of awscli by doing: aws --version Generically, if you plan on using keys for API Access, minimally you will need to configure the "aws" utility and then tweak the config a bit for our setup. To start, run "aws configure". It should look something like this: $ aws configure AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE AWS Secret Access Key [None]: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY Default region name [None]: us-west-2 Default output format [None]: Most folks do that to start. It creates two files: ~/.aws/config ~/.aws/credentials Those two files are important to access AWS via the 'aws' command. '''~/.aws/credentials''' This file contains your access key and secret key, and should not need to be modified after running 'aws configure'. Your same keys can be used to access any roles in any accounts you have access to. '''~/.aws/config''' This file contains some account information you will need to tweak. You may want to configure something like this: [default] region = us-west-2 [profile pangenomics-developer] source_profile = default role_arn = arn:aws:iam::422448306679:role/developer mfa_serial = arn:aws:iam::652235167018:mfa/bill@ucsc.edu duration_seconds = 43200 The "role_arn" line contains the role and account number you are accessing. You can see a list of live account numbers here: [[AWS Account List and Numbers]] Find the account number you need and enter it on the role_arn line, as well as the role name. You will get the role name from the cluster-admin group when you get access. The 'mfa_serial' line contains the identifier for your MFA device. It will always look like '''"arn:aws:iam::652235167018:mfa/[your_iam_username]"'''. The account number listed there will always be "652235167018" because that is the account number of the top level "gi-gateway" account. The "duration_seconds" parameter says that your session token will be 43200 seconds long (12 hours). That means you will only have to authenticate with MFA once every 12 hours. 12 hours is the maximum you can request, although you can specify less than that. This means it won't ask you for MFA every time you run a command for the next 12 hours. Once that is configured, you should be able to reference the profile you just created when using the aws command, like so: $ aws s3 ls --profile pangenomics-developer It will ask you for your MFA code and then run the command. Once you enter the MFA code, the token it creates will be valid for 12 hours if you specified "duration_seconds = 43200", or if you omitted that line, the default session duration is one hour, so you can run other 'aws' cli commands without the need to re-authenticate with MFA for the duration of the session. After the session expires, you will need to authenticate via MFA again. ==Tag Your Resources== When you start using AWS resources (instances, networks, etc), it is very important that you "tag" your resources with the "Owner" tag (note the capital "O"). "Owner" is the key, and the value assigned to it will be your IAM username (i.e. your email address). So, for example, if I spin up an instance, I would tag it during or after creation with something like: Owner = bob@ucsc.edu If you do not tag your instances, '''they will automatically be terminated with 10 minutes.''' Tag your instances especially, but tag every resource you create! This allows us to perform accounting tasks much more easily and allows the Program Managers to know which resources are controlled by who. 5a64822dd5263b07927a1ea0d100f7c75684f0f5 AWS Account List and Numbers 0 22 117 2019-02-06T23:58:37Z Weiler 3 Created page with "This is a list of our currently available AWS accounts and their account numbers: Production-BD2K : 862902209576 ucsc-toil-dev : 318423852362 ucsc-vg-d..." wikitext text/x-wiki This is a list of our currently available AWS accounts and their account numbers: Production-BD2K : 862902209576 ucsc-toil-dev : 318423852362 ucsc-vg-dev : 781907127277 ucsc-platform-dev : 719818754276 comparative-genomics-dev : 162786355865 nanopore-dev : 270442831226 ucsc-cgp-production : 097093801910 platform-hca : 122796619775 anvil-dev : 608666466534 gi-gateway : 652235167018 pangenomics : 422448306679 braingeneers : 443872533066 treehouse : 238605363322 03e6908796cefe7d0d98481fd88b46928a075ee7 118 117 2019-02-06T23:59:19Z Weiler 3 wikitext text/x-wiki This is a list of our currently available AWS accounts and their account numbers: ucsc-bd2k : 862902209576 ucsc-toil-dev : 318423852362 ucsc-vg-dev : 781907127277 ucsc-platform-dev : 719818754276 comparative-genomics-dev : 162786355865 nanopore-dev : 270442831226 ucsc-cgp-production : 097093801910 platform-hca : 122796619775 anvil-dev : 608666466534 gi-gateway : 652235167018 pangenomics : 422448306679 braingeneers : 443872533066 treehouse : 238605363322 fa42cb40158235fc7e35f2978e17528088a26711 126 118 2019-02-07T18:53:35Z Weiler 3 wikitext text/x-wiki This is a list of our currently available AWS accounts and their account numbers: ucsc-bd2k : 862902209576 ucsc-toil-dev : 318423852362 ucsc-vg-dev : 781907127277 ucsc-platform-dev : 719818754276 comparative-genomics-dev : 162786355865 nanopore-dev : 270442831226 ucsc-cgp-production : 097093801910 platform-hca : 122796619775 anvil-dev : 608666466534 gi-gateway : 652235167018 pangenomics : 422448306679 braingeneers : 443872533066 ucsctreehouse : 238605363322 a6c3acf0bd0db90173fda5167918258e6a6504bc 140 126 2019-04-11T16:55:56Z Weiler 3 wikitext text/x-wiki This is a list of our currently available AWS accounts and their account numbers: ucsc-bd2k : 862902209576 ucsc-toil-dev : 318423852362 ucsc-vg-dev : 781907127277 ucsc-platform-dev : 719818754276 comparative-genomics-dev : 162786355865 nanopore-dev : 270442831226 ucsc-cgp-production : 097093801910 platform-hca : 122796619775 anvil-dev : 608666466534 gi-gateway : 652235167018 pangenomics : 422448306679 braingeneers : 443872533066 ucsctreehouse : 238605363322 ucsc-bisti-dev : 851631505710 b30c8c8af6dbcf8c00fc4a5279acb93758a128f4 How to access the public servers 0 11 136 94 2019-02-13T22:56:11Z Weiler 3 /* Storage */ wikitext text/x-wiki == How to Gain Access to the Public Genomics Institute Compute Servers == If you need access to the Genomics Institute compute servers, please ask your PI or sponsor to email 'cluster-admin@soe.ucsc.edu' requesting that you be granted access. Then we can set up a quick meeting to create your account and go over the details. == Server Types and Management== You can ssh into our public compute servers via SSH: '''courtyard.gi.ucsc.edu''': 1TB RAM, 64 cores, 672GB local scratch space '''plaza.gi.ucsc.edu''': 1TB RAM, 64 cores, 672GB local scratch space These servers are running CentOS 7.5 Linux. They are managed by the Genomics Institute Cluster Admin group. If you need software installed on them, please make your request by emailing cluster-admin@soe.ucsc.edu. == Storage == These servers mount two types of storage; home directories and group storage directories. Your home directory will be located as "/public/home/username" and has a 30GB quota. The group storage directories are created per PI, and each group directory has a 15TB quota. For example, if David Haussler is the PI that you report to directly, then the directory would exist as /public/groups/hausslerlab. Request access to that group directory and you will then be able to write to it. Each of those group directories are shared by the lab it belongs to, so you must be wary of everyone's data usage and share the 15TB available per group accordingly. On the compute servers you can check your group's current quota usage by using the '/usr/bin/viewquota' command. You can only check the quota of a group you are part of (you would be a member of the UNIX group of the same name). If you wanted to check the quota usage of /public/groups/hausslerlab for example, you would do: $ viewquota hausslerlab Project quota on /export (/dev/mapper/export) Project ID Used Soft Hard Warn/Grace ---------- --------------------------------- hausslerlab 1.8T 15T 16T 00 [------] == Actually Doing Work and Computing == When doing research, running jobs and the like, please be careful of your resource consumption on the server you are on. Don't run too many threads or cores at once if such a thing overruns the RAM available or the disk IO available. If you are not sure of your potential RAM, CPU or disk impact, start small with one or two processes and work your way up from there. Also, before running your stuff, check what else is already happening on the server by using the 'top' command to see who else and what else is running and what kind of resources are already being consumed. If, after starting a process, you realize that the server slows down considerably or becomes unusable, kill your processes and re-evaluate what you need to make things work. These servers are shared resources - be a good neighbor! == Serving Files to the Public via the Web == If you want to setup a web page on courtyard, or serve files over HTTP from there, do this: mkdir /public/home/''your_username''/public_html chmod 755 /public/home/''your_username''/public_html Put data in the public_html directory. The URL will be: http://public.gi.ucsc.edu/''~username''/ == /scratch Space on the Servers == Each server will generally have a local /scratch filesystem that you can use to store temporary files. '''BE ADVISED''' that /scratch is not backed up, and the data there could disappear in the event of a disk failure or anything else. Do not store important data there. If it is important, it should be moved somewhere else very soon after creation. a7c98b49bb41344d0712ccd887ee72a17ccc1138 138 136 2019-02-22T22:47:56Z Weiler 3 wikitext text/x-wiki == How to Gain Access to the Public Genomics Institute Compute Servers == If you need access to the Genomics Institute compute servers, please ask your PI or sponsor to email 'cluster-admin@soe.ucsc.edu' requesting that you be granted access. Then we can set up a quick meeting to create your account and go over the details. == Account Expiration == Your UNIX account will have an expiration date associated with it after creation. If your account was created in January, February or March, then your account expiration date will be July 1st of the '''current year'''. If the account was created after March, then your expiration date will be July 1st of the '''following year'''. You will receive notice by email when your account is about to expire. To renew, simply ask the PI that sponsored you (which will be included in the notice) to email '''cluster-admin@soe.ucsc.edu''' requesting that your account to be renewed for another year. If your account expires, the account will be suspended and you will no longer be able to login or view any data you may have in our systems. Any automated scripts (owned by you) that run via cron or other mechanisms will cease to function. == Server Types and Management== You can ssh into our public compute servers via SSH: '''courtyard.gi.ucsc.edu''': 1TB RAM, 64 cores, 672GB local scratch space '''plaza.gi.ucsc.edu''': 1TB RAM, 64 cores, 672GB local scratch space These servers are running CentOS 7.5 Linux. They are managed by the Genomics Institute Cluster Admin group. If you need software installed on them, please make your request by emailing cluster-admin@soe.ucsc.edu. == Storage == These servers mount two types of storage; home directories and group storage directories. Your home directory will be located as "/public/home/username" and has a 30GB quota. The group storage directories are created per PI, and each group directory has a 15TB quota. For example, if David Haussler is the PI that you report to directly, then the directory would exist as /public/groups/hausslerlab. Request access to that group directory and you will then be able to write to it. Each of those group directories are shared by the lab it belongs to, so you must be wary of everyone's data usage and share the 15TB available per group accordingly. On the compute servers you can check your group's current quota usage by using the '/usr/bin/viewquota' command. You can only check the quota of a group you are part of (you would be a member of the UNIX group of the same name). If you wanted to check the quota usage of /public/groups/hausslerlab for example, you would do: $ viewquota hausslerlab Project quota on /export (/dev/mapper/export) Project ID Used Soft Hard Warn/Grace ---------- --------------------------------- hausslerlab 1.8T 15T 16T 00 [------] == Actually Doing Work and Computing == When doing research, running jobs and the like, please be careful of your resource consumption on the server you are on. Don't run too many threads or cores at once if such a thing overruns the RAM available or the disk IO available. If you are not sure of your potential RAM, CPU or disk impact, start small with one or two processes and work your way up from there. Also, before running your stuff, check what else is already happening on the server by using the 'top' command to see who else and what else is running and what kind of resources are already being consumed. If, after starting a process, you realize that the server slows down considerably or becomes unusable, kill your processes and re-evaluate what you need to make things work. These servers are shared resources - be a good neighbor! == Serving Files to the Public via the Web == If you want to setup a web page on courtyard, or serve files over HTTP from there, do this: mkdir /public/home/''your_username''/public_html chmod 755 /public/home/''your_username''/public_html Put data in the public_html directory. The URL will be: http://public.gi.ucsc.edu/''~username''/ == /scratch Space on the Servers == Each server will generally have a local /scratch filesystem that you can use to store temporary files. '''BE ADVISED''' that /scratch is not backed up, and the data there could disappear in the event of a disk failure or anything else. Do not store important data there. If it is important, it should be moved somewhere else very soon after creation. bbea13e90bf480a3b2d58a08d6a467e4ceb8bc95 Access to the Firewalled Compute Servers 0 17 137 96 2019-02-13T22:57:10Z Weiler 3 /* Storage */ wikitext text/x-wiki Before you can access the firewalled environment (Prism), you must get VPN access to it, which is detailed here: [[Requirement for users to get GI VPN access]] == Server Types and Management== After confirming your VPN software is working, you can ssh into one of the compute servers behind the VPN: '''crimson.prism''': 256GB RAM, 32 cores, 5.5TB local scratch space '''razzmatazz.prism''': 256GB RAM, 32 cores, 5.5TB local scratch space '''mustard.prism''': 1.5TB RAM, 160 cores, 9TB local scratch space These servers are running CentOS 7.5 Linux. They are managed by the Genomics Institute Cluster Admin group. If you need software installed on either or both of these servers, please make your request by emailing cluster-admin@soe.ucsc.edu. == Storage == These servers mount two types of storage; home directories and group storage directories. Your home directory will be located as "/private/home/username" and has a 30GB quota. The group storage directories are created per PI, and each group directory has a 15TB quota. For example, if David Haussler is the PI that you report to directly, then the directory would exist as /private/groups/hausslerlab. Request access to that group directory and you will then be able to write to it. Each of those group directories are shared by the lab it belongs to, so you must be wary of everyone's data usage and share the 15TB available per group accordingly. On the compute servers you can check your group's current quota usage by using the '/usr/bin/viewquota' command. You can only check the quota of a group you are part of (you would be a member of the UNIX group of the same name). If you wanted to check the quota usage of /private/groups/hausslerlab for example, you would do: $ viewquota hausslerlab Project quota on /export (/dev/mapper/export) Project ID Used Soft Hard Warn/Grace ---------- --------------------------------- hausslerlab 1.8T 15T 16T 00 [------] == Actually Doing Work and Computing == When doing research, running jobs and the like, please be careful of your resource consumption on the server you are on. Don't run too many threads or cores at once if such a thing overruns the RAM available or the disk IO available. If you are not sure of your potential RAM, CPU or disk impact, start small with one or two processes and work your way up from there. Also, before running your stuff, check what else is already happening on the server by using the 'top' command to see who else and what else is running and what kind of resources are already being consumed. If, after starting a process, you realize that the server slows down considerably or becomes unusable, kill your processes and re-evaluate what you need to make things work. These servers are shared resources - be a good neighbor! == The Firewall == All servers are behind a firewall in this environment, and as such, you must connect to the VPN in order to access them. They will not be accessible from the greater Internet without VPN. Although you will be able to connect outbound from them to other servers on the internet to copy data in, sync git repos, stuff like that. It is only inbound connections that will be blocked. All machines behind the firewall have the private domain name suffix of "*.prism". == /scratch Space on the Servers == Each server will generally have a local /scratch filesystem that you can use to store temporary files. '''BE ADVISED''' that /scratch is not backed up, and the data there could disappear in the event of a disk failure or anything else. Do not store important data there. If it is important, it should be moved somewhere else very soon after creation. 048aca92cafc31999caeff93677820cac67c3ebb 139 137 2019-02-22T22:48:53Z Weiler 3 wikitext text/x-wiki Before you can access the firewalled environment (Prism), you must get VPN access to it, which is detailed here: [[Requirement for users to get GI VPN access]] == Account Expiration == Your UNIX account will have an expiration date associated with it after creation. If your account was created in January, February or March, then your account expiration date will be July 1st of the '''current year'''. If the account was created after March, then your expiration date will be July 1st of the '''following year'''. You will receive notice by email when your account is about to expire. To renew, simply ask the PI that sponsored you (which will be included in the notice) to email '''cluster-admin@soe.ucsc.edu''' requesting that your account to be renewed for another year. If your account expires, the account will be suspended and you will no longer be able to login or view any data you may have in our systems. Any automated scripts (owned by you) that run via cron or other mechanisms will cease to function. == Server Types and Management== After confirming your VPN software is working, you can ssh into one of the compute servers behind the VPN: '''crimson.prism''': 256GB RAM, 32 cores, 5.5TB local scratch space '''razzmatazz.prism''': 256GB RAM, 32 cores, 5.5TB local scratch space '''mustard.prism''': 1.5TB RAM, 160 cores, 9TB local scratch space These servers are running CentOS 7.5 Linux. They are managed by the Genomics Institute Cluster Admin group. If you need software installed on either or both of these servers, please make your request by emailing cluster-admin@soe.ucsc.edu. == Storage == These servers mount two types of storage; home directories and group storage directories. Your home directory will be located as "/private/home/username" and has a 30GB quota. The group storage directories are created per PI, and each group directory has a 15TB quota. For example, if David Haussler is the PI that you report to directly, then the directory would exist as /private/groups/hausslerlab. Request access to that group directory and you will then be able to write to it. Each of those group directories are shared by the lab it belongs to, so you must be wary of everyone's data usage and share the 15TB available per group accordingly. On the compute servers you can check your group's current quota usage by using the '/usr/bin/viewquota' command. You can only check the quota of a group you are part of (you would be a member of the UNIX group of the same name). If you wanted to check the quota usage of /private/groups/hausslerlab for example, you would do: $ viewquota hausslerlab Project quota on /export (/dev/mapper/export) Project ID Used Soft Hard Warn/Grace ---------- --------------------------------- hausslerlab 1.8T 15T 16T 00 [------] == Actually Doing Work and Computing == When doing research, running jobs and the like, please be careful of your resource consumption on the server you are on. Don't run too many threads or cores at once if such a thing overruns the RAM available or the disk IO available. If you are not sure of your potential RAM, CPU or disk impact, start small with one or two processes and work your way up from there. Also, before running your stuff, check what else is already happening on the server by using the 'top' command to see who else and what else is running and what kind of resources are already being consumed. If, after starting a process, you realize that the server slows down considerably or becomes unusable, kill your processes and re-evaluate what you need to make things work. These servers are shared resources - be a good neighbor! == The Firewall == All servers are behind a firewall in this environment, and as such, you must connect to the VPN in order to access them. They will not be accessible from the greater Internet without VPN. Although you will be able to connect outbound from them to other servers on the internet to copy data in, sync git repos, stuff like that. It is only inbound connections that will be blocked. All machines behind the firewall have the private domain name suffix of "*.prism". == /scratch Space on the Servers == Each server will generally have a local /scratch filesystem that you can use to store temporary files. '''BE ADVISED''' that /scratch is not backed up, and the data there could disappear in the event of a disk failure or anything else. Do not store important data there. If it is important, it should be moved somewhere else very soon after creation. 5723f270e178f3243cb2b7c8ffab7d2c9e38948c Computational Genomics Kubernetes Installation 0 23 144 2019-09-04T19:24:57Z Weiler 3 Created page with "__TOC__ The Computation Genomics Group has a Kubernetes Cluster running on several large instances in AWS. ==Getting Authorized to Connect== If you require access to this k..." wikitext text/x-wiki __TOC__ The Computation Genomics Group has a Kubernetes Cluster running on several large instances in AWS. ==Getting Authorized to Connect== If you require access to this kubernetes cluster, contact Benedict Paten asking for permission to use it, then pass on that permission via email to: cluster-admin@soe.ucsc.edu Let us know which group you are with and we can authorize you to use the cluster in the correct namespace. ==Authenticating to Kubernetes== We will authorize (authz) you to use the cluster on the server side, but you will also need to authenticate (authn) using your '@ucsc.edu' email address and a unique Java Web token. These credentials are installed in ~/.kube/config in whatever machine you are coming from to get to the cluster. To authenticate and get your base kubernetes configuration, go to this URL (below), which will ask you to authenticate to Google. Use your '@ucsc.edu' email address as the login. It will then ask you to authenticate via CruzID Gold if your web browser doesn't already have the authentication token cached: https://cg-kube-auth.gi.ucsc.edu Once you authenticate, it will pass you back to the 'https://cg-kube-auth.gi.ucsc.edu' website and it should confirm authentication on the top with a message saying "Successfully Authenticated". '''If you see any errors in red,''' but are sure you typed in your password and 2-factor auth correctly, click on the above link again and authenticate a second time, which should work. There is a quirk where the web token doesn't always pass back to us correctly on the first try. Upon success, you will be able to click the blue "Download Config File" button, which contains your initial kubernetes config file. Copy this file to your home directory as ~/.kube/config. Follow the directions on the web page to insert your '''"namespace:"''' line as directed. We will let you know which namespace to use. ==Testing Connectivity= Once your ~/.kube/config file is set up correctly, you should be able to connect to the cluster. All our shared servers here at the Genomics Institute have the 'kubectl' command installed on them, but if you are coming from somewhere else, just make sure the "kubectl" utility is installed on that machine. A quick test should go as follows: $ kubectl get nodes NAME STATUS ROLES AGE VERSION k1.kube Ready <none> 13h v1.15.3 k2.kube Ready <none> 13h v1.15.3 master.kube Ready master 13h v1.15.3 c3b3cfb0dd4dbb59ec563e48f09d09e5a6f8efe9 145 144 2019-09-04T19:25:14Z Weiler 3 wikitext text/x-wiki __TOC__ The Computation Genomics Group has a Kubernetes Cluster running on several large instances in AWS. ==Getting Authorized to Connect== If you require access to this kubernetes cluster, contact Benedict Paten asking for permission to use it, then pass on that permission via email to: cluster-admin@soe.ucsc.edu Let us know which group you are with and we can authorize you to use the cluster in the correct namespace. ==Authenticating to Kubernetes== We will authorize (authz) you to use the cluster on the server side, but you will also need to authenticate (authn) using your '@ucsc.edu' email address and a unique Java Web token. These credentials are installed in ~/.kube/config in whatever machine you are coming from to get to the cluster. To authenticate and get your base kubernetes configuration, go to this URL (below), which will ask you to authenticate to Google. Use your '@ucsc.edu' email address as the login. It will then ask you to authenticate via CruzID Gold if your web browser doesn't already have the authentication token cached: https://cg-kube-auth.gi.ucsc.edu Once you authenticate, it will pass you back to the 'https://cg-kube-auth.gi.ucsc.edu' website and it should confirm authentication on the top with a message saying "Successfully Authenticated". '''If you see any errors in red,''' but are sure you typed in your password and 2-factor auth correctly, click on the above link again and authenticate a second time, which should work. There is a quirk where the web token doesn't always pass back to us correctly on the first try. Upon success, you will be able to click the blue "Download Config File" button, which contains your initial kubernetes config file. Copy this file to your home directory as ~/.kube/config. Follow the directions on the web page to insert your '''"namespace:"''' line as directed. We will let you know which namespace to use. ==Testing Connectivity== Once your ~/.kube/config file is set up correctly, you should be able to connect to the cluster. All our shared servers here at the Genomics Institute have the 'kubectl' command installed on them, but if you are coming from somewhere else, just make sure the "kubectl" utility is installed on that machine. A quick test should go as follows: $ kubectl get nodes NAME STATUS ROLES AGE VERSION k1.kube Ready <none> 13h v1.15.3 k2.kube Ready <none> 13h v1.15.3 master.kube Ready master 13h v1.15.3 9d8628dddd89172f96bbe8ef6ecc16d3e4ef67d6 146 145 2019-09-04T19:36:26Z Weiler 3 wikitext text/x-wiki __TOC__ The Computation Genomics Group has a Kubernetes Cluster running on several large instances in AWS. ==Getting Authorized to Connect== If you require access to this kubernetes cluster, contact Benedict Paten asking for permission to use it, then pass on that permission via email to: cluster-admin@soe.ucsc.edu Let us know which group you are with and we can authorize you to use the cluster in the correct namespace. ==Authenticating to Kubernetes== We will authorize (authz) you to use the cluster on the server side, but you will also need to authenticate (authn) using your '@ucsc.edu' email address and a unique Java Web token. These credentials are installed in ~/.kube/config in whatever machine you are coming from to get to the cluster. To authenticate and get your base kubernetes configuration, go to this URL (below), which will ask you to authenticate to Google. Use your '@ucsc.edu' email address as the login. It will then ask you to authenticate via CruzID Gold if your web browser doesn't already have the authentication token cached: https://cg-kube-auth.gi.ucsc.edu Once you authenticate, it will pass you back to the 'https://cg-kube-auth.gi.ucsc.edu' website and it should confirm authentication on the top with a message saying "Successfully Authenticated". '''If you see any errors in red,''' but are sure you typed in your password and 2-factor auth correctly, click on the above link again and authenticate a second time, which should work. There is a quirk where the web token doesn't always pass back to us correctly on the first try. Upon success, you will be able to click the blue "Download Config File" button, which contains your initial kubernetes config file. Copy this file to your home directory as ~/.kube/config. Follow the directions on the web page to insert your '''"namespace:"''' line as directed. We will let you know which namespace to use. ==Testing Connectivity== Once your ~/.kube/config file is set up correctly, you should be able to connect to the cluster. All our shared servers here at the Genomics Institute have the 'kubectl' command installed on them, but if you are coming from somewhere else, just make sure the "kubectl" utility is installed on that machine. A quick test should go as follows: $ kubectl get nodes NAME STATUS ROLES AGE VERSION k1.kube Ready <none> 13h v1.15.3 k2.kube Ready <none> 13h v1.15.3 master.kube Ready master 13h v1.15.3 ==View the Cluster's Current Activity== You can take a look at current resource consumption by taking a look at our Ganglia Cluster monitor tool: https://ganglia.gi.ucsc.edu/ That website requires a username and password: username: genecats password: KiloKluster That's mostly for keeping the scrip kiddies and bots from banging on it. Once you get in, you should see a drop-down menu near the top left of the screen near "Genomics Institute Grid". From the drop-down menu, select "CG Kubernetes Cluster". It will take you to a page detailing the current resource usage and activity on the nodes. This can be useful for see if anyone else is using the whole cluster, or just to get an idea of how many resources are available for your batch of jobs to assign to the cluster. cb96f82367c5ab8376d3557ed3562b22ae59f92c 148 146 2019-09-04T19:59:25Z Weiler 3 wikitext text/x-wiki __TOC__ The Computation Genomics Group has a Kubernetes Cluster running on several large instances in AWS. ==Getting Authorized to Connect== If you require access to this kubernetes cluster, contact Benedict Paten asking for permission to use it, then pass on that permission via email to: cluster-admin@soe.ucsc.edu Let us know which group you are with and we can authorize you to use the cluster in the correct namespace. ==Authenticating to Kubernetes== We will authorize (authz) you to use the cluster on the server side, but you will also need to authenticate (authn) using your '@ucsc.edu' email address and a unique Java Web token. These credentials are installed in ~/.kube/config in whatever machine you are coming from to get to the cluster. To authenticate and get your base kubernetes configuration, go to this URL (below), which will ask you to authenticate to Google. Use your '@ucsc.edu' email address as the login. It will then ask you to authenticate via CruzID Gold if your web browser doesn't already have the authentication token cached: https://cg-kube-auth.gi.ucsc.edu Once you authenticate (via username/password and 2-factor auth for CruzID Gold)), it will pass you back to the 'https://cg-kube-auth.gi.ucsc.edu' website and it should confirm authentication on the top with a message saying "Successfully Authenticated". '''If you see any errors in red,''' but are sure you typed in your password and 2-factor auth correctly, click on the above link again (https://cg-kube-auth.gi.ucsc.edu) and authenticate a second time, which should work. There is a quirk where the web token doesn't always pass back to us correctly on the first try. Upon success, you will be able to click the blue "Download Config File" button, which contains your initial kubernetes config file. Copy this file to your home directory as ~/.kube/config. Follow the directions on the web page to insert your '''"namespace:"''' line as directed. We will let you know which namespace to use. ==Testing Connectivity== Once your ~/.kube/config file is set up correctly, you should be able to connect to the cluster. All our shared servers here at the Genomics Institute have the 'kubectl' command installed on them, but if you are coming from somewhere else, just make sure the "kubectl" utility is installed on that machine. A quick test should go as follows: $ kubectl get nodes NAME STATUS ROLES AGE VERSION k1.kube Ready <none> 13h v1.15.3 k2.kube Ready <none> 13h v1.15.3 master.kube Ready master 13h v1.15.3 ==View the Cluster's Current Activity== You can take a look at current resource consumption by taking a look at our Ganglia Cluster monitor tool: https://ganglia.gi.ucsc.edu/ That website requires a username and password: username: genecats password: KiloKluster That's mostly for keeping the scrip kiddies and bots from banging on it. Once you get in, you should see a drop-down menu near the top left of the screen near "Genomics Institute Grid". From the drop-down menu, select "CG Kubernetes Cluster". It will take you to a page detailing the current resource usage and activity on the nodes. This can be useful for see if anyone else is using the whole cluster, or just to get an idea of how many resources are available for your batch of jobs to assign to the cluster. 69347e4555c71368613732e777d6c8e96fa7f204 149 148 2019-09-04T20:05:28Z Weiler 3 wikitext text/x-wiki __TOC__ The Computational Genomics Group has a Kubernetes Cluster running on several large instances in AWS. The current cluster makeup includes two worker nodes, each with the following specs: * 96 CPU cores (3.1 GHz) * 384 GB RAM * 3.3 TB Local NVMe Flash Storage * 25 Gb/s Network Interface ==Getting Authorized to Connect== If you require access to this kubernetes cluster, contact Benedict Paten asking for permission to use it, then pass on that permission via email to: cluster-admin@soe.ucsc.edu Let us know which group you are with and we can authorize you to use the cluster in the correct namespace. ==Authenticating to Kubernetes== We will authorize (authz) you to use the cluster on the server side, but you will also need to authenticate (authn) using your '@ucsc.edu' email address and a unique Java Web token. These credentials are installed in ~/.kube/config in whatever machine you are coming from to get to the cluster. To authenticate and get your base kubernetes configuration, go to this URL (below), which will ask you to authenticate to Google. Use your '@ucsc.edu' email address as the login. It will then ask you to authenticate via CruzID Gold if your web browser doesn't already have the authentication token cached: https://cg-kube-auth.gi.ucsc.edu Once you authenticate (via username/password and 2-factor auth for CruzID Gold)), it will pass you back to the 'https://cg-kube-auth.gi.ucsc.edu' website and it should confirm authentication on the top with a message saying "Successfully Authenticated". '''If you see any errors in red,''' but are sure you typed in your password and 2-factor auth correctly, click on the above link again (https://cg-kube-auth.gi.ucsc.edu) and authenticate a second time, which should work. There is a quirk where the web token doesn't always pass back to us correctly on the first try. Upon success, you will be able to click the blue "Download Config File" button, which contains your initial kubernetes config file. Copy this file to your home directory as ~/.kube/config. Follow the directions on the web page to insert your '''"namespace:"''' line as directed. We will let you know which namespace to use. ==Testing Connectivity== Once your ~/.kube/config file is set up correctly, you should be able to connect to the cluster. All our shared servers here at the Genomics Institute have the 'kubectl' command installed on them, but if you are coming from somewhere else, just make sure the "kubectl" utility is installed on that machine. A quick test should go as follows: $ kubectl get nodes NAME STATUS ROLES AGE VERSION k1.kube Ready <none> 13h v1.15.3 k2.kube Ready <none> 13h v1.15.3 master.kube Ready master 13h v1.15.3 ==View the Cluster's Current Activity== You can take a look at current resource consumption by taking a look at our Ganglia Cluster monitor tool: https://ganglia.gi.ucsc.edu/ That website requires a username and password: username: genecats password: KiloKluster That's mostly for keeping the scrip kiddies and bots from banging on it. Once you get in, you should see a drop-down menu near the top left of the screen near "Genomics Institute Grid". From the drop-down menu, select "CG Kubernetes Cluster". It will take you to a page detailing the current resource usage and activity on the nodes. This can be useful for see if anyone else is using the whole cluster, or just to get an idea of how many resources are available for your batch of jobs to assign to the cluster. 86f07d928a5f4753b8523af8e78a9a715032ae11 150 149 2019-09-04T22:28:15Z Weiler 3 wikitext text/x-wiki __TOC__ The Computational Genomics Group has a Kubernetes Cluster running on several large instances in AWS. The current cluster makeup includes two worker nodes, each with the following specs: * 96 CPU cores (3.1 GHz) * 384 GB RAM * 3.3 TB Local NVMe Flash Storage * 25 Gb/s Network Interface ==Getting Authorized to Connect== If you require access to this kubernetes cluster, contact Benedict Paten asking for permission to use it, then pass on that permission via email to: cluster-admin@soe.ucsc.edu Let us know which group you are with and we can authorize you to use the cluster in the correct namespace. ==Authenticating to Kubernetes== We will authorize (authz) you to use the cluster on the server side, but you will also need to authenticate (authn) using your '@ucsc.edu' email address and a unique Java Web token. These credentials are installed in ~/.kube/config in whatever machine you are coming from to get to the cluster. To authenticate and get your base kubernetes configuration, go to this URL (below), which will ask you to authenticate to Google. Use your '@ucsc.edu' email address as the login. It will then ask you to authenticate via CruzID Gold if your web browser doesn't already have the authentication token cached: https://cg-kube-auth.gi.ucsc.edu Once you authenticate (via username/password and 2-factor auth for CruzID Gold)), it will pass you back to the 'https://cg-kube-auth.gi.ucsc.edu' website and it should confirm authentication on the top with a message saying "Successfully Authenticated". '''If you see any errors in red,''' but are sure you typed in your password and 2-factor auth correctly, click on the above link again (https://cg-kube-auth.gi.ucsc.edu) and authenticate a second time, which should work. There is a quirk where the web token doesn't always pass back to us correctly on the first try. Upon success, you will be able to click the blue "Download Config File" button, which contains your initial kubernetes config file. Copy this file to your home directory as ~/.kube/config. Follow the directions on the web page to insert your '''"namespace:"''' line as directed. We will let you know which namespace to use. ==Testing Connectivity== Once your ~/.kube/config file is set up correctly, you should be able to connect to the cluster. All our shared servers here at the Genomics Institute have the 'kubectl' command installed on them, but if you are coming from somewhere else, just make sure the "kubectl" utility is installed on that machine. A quick test should go as follows: $ kubectl get nodes NAME STATUS ROLES AGE VERSION k1.kube Ready <none> 13h v1.15.3 k2.kube Ready <none> 13h v1.15.3 master.kube Ready master 13h v1.15.3 ==Running Pods and Jobs== When running jobs and pods on kubernetes, you will always want to specify "requests" and "limits" on resources, otherwise you pods will get stuck with the default limits which are tiny (to protect against runaway pods). Here is a good example of a job file that specifies limits: job.yml apiVersion: batch/v1 kind: Job metadata: name: $USER-$TS spec: backoffLimit: 0 ttlSecondsAfterFinished: 30 template: spec: containers: - name: magic image: robcurrie/ubuntu imagePullPolicy: Always resources: requests: cpu: "30" memory: "30G" limits: cpu: "31" memory: "40G" command: ["/bin/bash", "-c"] args: ['for i in {1..100}; do echo "$i: $(date)"; i=$((i+1)); sleep 1; done'] restartPolicy: Never ==View the Cluster's Current Activity== You can take a look at current resource consumption by taking a look at our Ganglia Cluster monitor tool: https://ganglia.gi.ucsc.edu/ That website requires a username and password: username: genecats password: KiloKluster That's mostly for keeping the scrip kiddies and bots from banging on it. Once you get in, you should see a drop-down menu near the top left of the screen near "Genomics Institute Grid". From the drop-down menu, select "CG Kubernetes Cluster". It will take you to a page detailing the current resource usage and activity on the nodes. This can be useful for see if anyone else is using the whole cluster, or just to get an idea of how many resources are available for your batch of jobs to assign to the cluster. c80c71a5be74e754a7d14c275acebc45f6d46973 151 150 2019-09-04T22:34:52Z Weiler 3 /* Running Pods and Jobs */ wikitext text/x-wiki __TOC__ The Computational Genomics Group has a Kubernetes Cluster running on several large instances in AWS. The current cluster makeup includes two worker nodes, each with the following specs: * 96 CPU cores (3.1 GHz) * 384 GB RAM * 3.3 TB Local NVMe Flash Storage * 25 Gb/s Network Interface ==Getting Authorized to Connect== If you require access to this kubernetes cluster, contact Benedict Paten asking for permission to use it, then pass on that permission via email to: cluster-admin@soe.ucsc.edu Let us know which group you are with and we can authorize you to use the cluster in the correct namespace. ==Authenticating to Kubernetes== We will authorize (authz) you to use the cluster on the server side, but you will also need to authenticate (authn) using your '@ucsc.edu' email address and a unique Java Web token. These credentials are installed in ~/.kube/config in whatever machine you are coming from to get to the cluster. To authenticate and get your base kubernetes configuration, go to this URL (below), which will ask you to authenticate to Google. Use your '@ucsc.edu' email address as the login. It will then ask you to authenticate via CruzID Gold if your web browser doesn't already have the authentication token cached: https://cg-kube-auth.gi.ucsc.edu Once you authenticate (via username/password and 2-factor auth for CruzID Gold)), it will pass you back to the 'https://cg-kube-auth.gi.ucsc.edu' website and it should confirm authentication on the top with a message saying "Successfully Authenticated". '''If you see any errors in red,''' but are sure you typed in your password and 2-factor auth correctly, click on the above link again (https://cg-kube-auth.gi.ucsc.edu) and authenticate a second time, which should work. There is a quirk where the web token doesn't always pass back to us correctly on the first try. Upon success, you will be able to click the blue "Download Config File" button, which contains your initial kubernetes config file. Copy this file to your home directory as ~/.kube/config. Follow the directions on the web page to insert your '''"namespace:"''' line as directed. We will let you know which namespace to use. ==Testing Connectivity== Once your ~/.kube/config file is set up correctly, you should be able to connect to the cluster. All our shared servers here at the Genomics Institute have the 'kubectl' command installed on them, but if you are coming from somewhere else, just make sure the "kubectl" utility is installed on that machine. A quick test should go as follows: $ kubectl get nodes NAME STATUS ROLES AGE VERSION k1.kube Ready <none> 13h v1.15.3 k2.kube Ready <none> 13h v1.15.3 master.kube Ready master 13h v1.15.3 ==Running Pods and Jobs== When running jobs and pods on kubernetes, you will always want to specify "requests" and "limits" on resources, otherwise you pods will get stuck with the default limits which are tiny (to protect against runaway pods). Here is a good example of a job file that specifies limits: job.yml apiVersion: batch/v1 kind: Job metadata: name: $USER-$TS spec: backoffLimit: 0 ttlSecondsAfterFinished: 30 template: spec: containers: - name: magic image: robcurrie/ubuntu imagePullPolicy: Always resources: requests: cpu: "1" memory: "2G" ephemeral-storage: "2G" limits: cpu: "2" memory: "3G" ephemeral-storage: "3G" command: ["/bin/bash", "-c"] args: ['for i in {1..100}; do echo "$i: $(date)"; i=$((i+1)); sleep 1; done'] restartPolicy: Never ==View the Cluster's Current Activity== You can take a look at current resource consumption by taking a look at our Ganglia Cluster monitor tool: https://ganglia.gi.ucsc.edu/ That website requires a username and password: username: genecats password: KiloKluster That's mostly for keeping the scrip kiddies and bots from banging on it. Once you get in, you should see a drop-down menu near the top left of the screen near "Genomics Institute Grid". From the drop-down menu, select "CG Kubernetes Cluster". It will take you to a page detailing the current resource usage and activity on the nodes. This can be useful for see if anyone else is using the whole cluster, or just to get an idea of how many resources are available for your batch of jobs to assign to the cluster. 5d12ec043957203462f7c08831070a9c8b2239e1 Computational Genomics Kubernetes Installation 0 23 152 151 2019-09-04T22:36:45Z Weiler 3 /* Running Pods and Jobs */ wikitext text/x-wiki __TOC__ The Computational Genomics Group has a Kubernetes Cluster running on several large instances in AWS. The current cluster makeup includes two worker nodes, each with the following specs: * 96 CPU cores (3.1 GHz) * 384 GB RAM * 3.3 TB Local NVMe Flash Storage * 25 Gb/s Network Interface ==Getting Authorized to Connect== If you require access to this kubernetes cluster, contact Benedict Paten asking for permission to use it, then pass on that permission via email to: cluster-admin@soe.ucsc.edu Let us know which group you are with and we can authorize you to use the cluster in the correct namespace. ==Authenticating to Kubernetes== We will authorize (authz) you to use the cluster on the server side, but you will also need to authenticate (authn) using your '@ucsc.edu' email address and a unique Java Web token. These credentials are installed in ~/.kube/config in whatever machine you are coming from to get to the cluster. To authenticate and get your base kubernetes configuration, go to this URL (below), which will ask you to authenticate to Google. Use your '@ucsc.edu' email address as the login. It will then ask you to authenticate via CruzID Gold if your web browser doesn't already have the authentication token cached: https://cg-kube-auth.gi.ucsc.edu Once you authenticate (via username/password and 2-factor auth for CruzID Gold)), it will pass you back to the 'https://cg-kube-auth.gi.ucsc.edu' website and it should confirm authentication on the top with a message saying "Successfully Authenticated". '''If you see any errors in red,''' but are sure you typed in your password and 2-factor auth correctly, click on the above link again (https://cg-kube-auth.gi.ucsc.edu) and authenticate a second time, which should work. There is a quirk where the web token doesn't always pass back to us correctly on the first try. Upon success, you will be able to click the blue "Download Config File" button, which contains your initial kubernetes config file. Copy this file to your home directory as ~/.kube/config. Follow the directions on the web page to insert your '''"namespace:"''' line as directed. We will let you know which namespace to use. ==Testing Connectivity== Once your ~/.kube/config file is set up correctly, you should be able to connect to the cluster. All our shared servers here at the Genomics Institute have the 'kubectl' command installed on them, but if you are coming from somewhere else, just make sure the "kubectl" utility is installed on that machine. A quick test should go as follows: $ kubectl get nodes NAME STATUS ROLES AGE VERSION k1.kube Ready <none> 13h v1.15.3 k2.kube Ready <none> 13h v1.15.3 master.kube Ready master 13h v1.15.3 ==Running Pods and Jobs with Requests and Limits== When running jobs and pods on kubernetes, you will always want to specify "requests" and "limits" on resources, otherwise you pods will get stuck with the default limits which are tiny (to protect against runaway pods). You should always have an idea of how much resources will be consumed by your jobs, and not use much more than that, in order not to hog all the resources. It also prevents your jib from "running away" unexpectedly and chewing up more resources than expected. Here is a good example of a job file that specifies limits: job.yml apiVersion: batch/v1 kind: Job metadata: name: $USER-$TS spec: backoffLimit: 0 ttlSecondsAfterFinished: 30 template: spec: containers: - name: magic image: robcurrie/ubuntu imagePullPolicy: Always resources: requests: cpu: "1" memory: "2G" ephemeral-storage: "2G" limits: cpu: "2" memory: "3G" ephemeral-storage: "3G" command: ["/bin/bash", "-c"] args: ['for i in {1..100}; do echo "$i: $(date)"; i=$((i+1)); sleep 1; done'] restartPolicy: Never ==View the Cluster's Current Activity== You can take a look at current resource consumption by taking a look at our Ganglia Cluster monitor tool: https://ganglia.gi.ucsc.edu/ That website requires a username and password: username: genecats password: KiloKluster That's mostly for keeping the scrip kiddies and bots from banging on it. Once you get in, you should see a drop-down menu near the top left of the screen near "Genomics Institute Grid". From the drop-down menu, select "CG Kubernetes Cluster". It will take you to a page detailing the current resource usage and activity on the nodes. This can be useful for see if anyone else is using the whole cluster, or just to get an idea of how many resources are available for your batch of jobs to assign to the cluster. 617de74c11f1c305b5a24e325495ac1e0b485d23 153 152 2019-09-06T19:38:16Z Weiler 3 wikitext text/x-wiki __TOC__ The Computational Genomics Group has a Kubernetes Cluster running on several large instances in AWS. The current cluster makeup includes two worker nodes, each with the following specs: * 96 CPU cores (3.1 GHz) * 384 GB RAM * 3.3 TB Local NVMe Flash Storage * 25 Gb/s Network Interface ==Getting Authorized to Connect== If you require access to this kubernetes cluster, contact Benedict Paten asking for permission to use it, then pass on that permission via email to: cluster-admin@soe.ucsc.edu Let us know which group you are with and we can authorize you to use the cluster in the correct namespace. ==Authenticating to Kubernetes== We will authorize (authz) you to use the cluster on the server side, but you will also need to authenticate (authn) using your '@ucsc.edu' email address and a unique Java Web token. These credentials are installed in ~/.kube/config in whatever machine you are coming from to get to the cluster. To authenticate and get your base kubernetes configuration, go to this URL (below), which will ask you to authenticate to Google. Use your '@ucsc.edu' email address as the login. It will then ask you to authenticate via CruzID Gold if your web browser doesn't already have the authentication token cached: https://cg-kube-auth.gi.ucsc.edu Once you authenticate (via username/password and 2-factor auth for CruzID Gold)), it will pass you back to the 'https://cg-kube-auth.gi.ucsc.edu' website and it should confirm authentication on the top with a message saying "Successfully Authenticated". '''If you see any errors in red,''' but are sure you typed in your password and 2-factor auth correctly, click on the above link again (https://cg-kube-auth.gi.ucsc.edu) and authenticate a second time, which should work. There is a quirk where the web token doesn't always pass back to us correctly on the first try. Upon success, you will be able to click the blue "Download Config File" button, which contains your initial kubernetes config file. Copy this file to your home directory as ~/.kube/config. Follow the directions on the web page to insert your '''"namespace:"''' line as directed. We will let you know which namespace to use. ==Testing Connectivity== Once your ~/.kube/config file is set up correctly, you should be able to connect to the cluster. All our shared servers here at the Genomics Institute have the 'kubectl' command installed on them, but if you are coming from somewhere else, just make sure the "kubectl" utility is installed on that machine. A quick test should go as follows: $ kubectl get nodes NAME STATUS ROLES AGE VERSION k1.kube Ready <none> 13h v1.15.3 k2.kube Ready <none> 13h v1.15.3 master.kube Ready master 13h v1.15.3 ==Running Pods and Jobs with Requests and Limits== When running jobs and pods on kubernetes, you will always want to specify "requests" and "limits" on resources, otherwise you pods will get stuck with the default limits which are tiny (to protect against runaway pods). You should always have an idea of how much resources will be consumed by your jobs, and not use much more than that, in order not to hog all the resources. It also prevents your jib from "running away" unexpectedly and chewing up more resources than expected. Here is a good example of a job file that specifies limits: job.yml apiVersion: batch/v1 kind: Job metadata: name: $USER-$TS spec: backoffLimit: 0 ttlSecondsAfterFinished: 30 template: spec: containers: - name: magic image: robcurrie/ubuntu imagePullPolicy: Always resources: requests: cpu: "1" memory: "2G" ephemeral-storage: "2G" limits: cpu: "2" memory: "3G" ephemeral-storage: "3G" command: ["/bin/bash", "-c"] args: ['for i in {1..100}; do echo "$i: $(date)"; i=$((i+1)); sleep 1; done'] restartPolicy: Never '''NOTE:''' Jobs and pods that have '''completed''' over 48 hours ago but have not been cleaned up will be automatically removed by the garbage collector. Most jobs will have the "ttlSecondsAfterFinished" configuration item in them, so they will automatically cleaned up after that time expires, but leaving the old pods and jobs around pins the disk space they were using while they remain, so it's good to get rid of them as soon as they are done unless you are debugging a failure or something like that. Jobs that run over 48 hours will not be deleted, only the ones that have '''exited''' over 48 hours ago. A lot of other good information can be viewed on Rob Currie's github page, which includes examples and some "How To" documentation: https://github.com/rcurrie/kubernetes ==View the Cluster's Current Activity== You can take a look at current resource consumption by taking a look at our Ganglia Cluster monitor tool: https://ganglia.gi.ucsc.edu/ That website requires a username and password: username: genecats password: KiloKluster That's mostly for keeping the scrip kiddies and bots from banging on it. Once you get in, you should see a drop-down menu near the top left of the screen near "Genomics Institute Grid". From the drop-down menu, select "CG Kubernetes Cluster". It will take you to a page detailing the current resource usage and activity on the nodes. This can be useful for see if anyone else is using the whole cluster, or just to get an idea of how many resources are available for your batch of jobs to assign to the cluster. b10c3696104743ef494cc61710d675a259166002 154 153 2019-09-06T19:39:03Z Weiler 3 /* Running Pods and Jobs with Requests and Limits */ wikitext text/x-wiki __TOC__ The Computational Genomics Group has a Kubernetes Cluster running on several large instances in AWS. The current cluster makeup includes two worker nodes, each with the following specs: * 96 CPU cores (3.1 GHz) * 384 GB RAM * 3.3 TB Local NVMe Flash Storage * 25 Gb/s Network Interface ==Getting Authorized to Connect== If you require access to this kubernetes cluster, contact Benedict Paten asking for permission to use it, then pass on that permission via email to: cluster-admin@soe.ucsc.edu Let us know which group you are with and we can authorize you to use the cluster in the correct namespace. ==Authenticating to Kubernetes== We will authorize (authz) you to use the cluster on the server side, but you will also need to authenticate (authn) using your '@ucsc.edu' email address and a unique Java Web token. These credentials are installed in ~/.kube/config in whatever machine you are coming from to get to the cluster. To authenticate and get your base kubernetes configuration, go to this URL (below), which will ask you to authenticate to Google. Use your '@ucsc.edu' email address as the login. It will then ask you to authenticate via CruzID Gold if your web browser doesn't already have the authentication token cached: https://cg-kube-auth.gi.ucsc.edu Once you authenticate (via username/password and 2-factor auth for CruzID Gold)), it will pass you back to the 'https://cg-kube-auth.gi.ucsc.edu' website and it should confirm authentication on the top with a message saying "Successfully Authenticated". '''If you see any errors in red,''' but are sure you typed in your password and 2-factor auth correctly, click on the above link again (https://cg-kube-auth.gi.ucsc.edu) and authenticate a second time, which should work. There is a quirk where the web token doesn't always pass back to us correctly on the first try. Upon success, you will be able to click the blue "Download Config File" button, which contains your initial kubernetes config file. Copy this file to your home directory as ~/.kube/config. Follow the directions on the web page to insert your '''"namespace:"''' line as directed. We will let you know which namespace to use. ==Testing Connectivity== Once your ~/.kube/config file is set up correctly, you should be able to connect to the cluster. All our shared servers here at the Genomics Institute have the 'kubectl' command installed on them, but if you are coming from somewhere else, just make sure the "kubectl" utility is installed on that machine. A quick test should go as follows: $ kubectl get nodes NAME STATUS ROLES AGE VERSION k1.kube Ready <none> 13h v1.15.3 k2.kube Ready <none> 13h v1.15.3 master.kube Ready master 13h v1.15.3 ==Running Pods and Jobs with Requests and Limits== When running jobs and pods on kubernetes, you will always want to specify "requests" and "limits" on resources, otherwise you pods will get stuck with the default limits which are tiny (to protect against runaway pods). You should always have an idea of how much resources will be consumed by your jobs, and not use much more than that, in order not to hog all the resources. It also prevents your jib from "running away" unexpectedly and chewing up more resources than expected. Here is a good example of a job file that specifies limits: job.yml apiVersion: batch/v1 kind: Job metadata: name: $USER-$TS spec: backoffLimit: 0 ttlSecondsAfterFinished: 30 template: spec: containers: - name: magic image: robcurrie/ubuntu imagePullPolicy: Always resources: requests: cpu: "1" memory: "2G" ephemeral-storage: "2G" limits: cpu: "2" memory: "3G" ephemeral-storage: "3G" command: ["/bin/bash", "-c"] args: ['for i in {1..100}; do echo "$i: $(date)"; i=$((i+1)); sleep 1; done'] restartPolicy: Never '''NOTE:''' Jobs and pods that have '''completed''' over 48 hours ago but have not been cleaned up will be automatically removed by the garbage collector. Most jobs will have the "ttlSecondsAfterFinished" configuration item in them, so they will automatically cleaned up after that time expires, but leaving the old pods and jobs around pins the disk space they were using while they remain, so it's good to get rid of them as soon as they are done unless you are debugging a failure or something like that. Jobs that '''run''' over 48 hours will not be deleted, only the ones that have '''exited''' over 48 hours ago. A lot of other good information can be viewed on Rob Currie's github page, which includes examples and some "How To" documentation: https://github.com/rcurrie/kubernetes ==View the Cluster's Current Activity== You can take a look at current resource consumption by taking a look at our Ganglia Cluster monitor tool: https://ganglia.gi.ucsc.edu/ That website requires a username and password: username: genecats password: KiloKluster That's mostly for keeping the scrip kiddies and bots from banging on it. Once you get in, you should see a drop-down menu near the top left of the screen near "Genomics Institute Grid". From the drop-down menu, select "CG Kubernetes Cluster". It will take you to a page detailing the current resource usage and activity on the nodes. This can be useful for see if anyone else is using the whole cluster, or just to get an idea of how many resources are available for your batch of jobs to assign to the cluster. db25afc00f8eaa28404be30737ddb4ee47050a02 155 154 2019-09-11T23:16:47Z Weiler 3 /* Authenticating to Kubernetes */ wikitext text/x-wiki __TOC__ The Computational Genomics Group has a Kubernetes Cluster running on several large instances in AWS. The current cluster makeup includes two worker nodes, each with the following specs: * 96 CPU cores (3.1 GHz) * 384 GB RAM * 3.3 TB Local NVMe Flash Storage * 25 Gb/s Network Interface ==Getting Authorized to Connect== If you require access to this kubernetes cluster, contact Benedict Paten asking for permission to use it, then pass on that permission via email to: cluster-admin@soe.ucsc.edu Let us know which group you are with and we can authorize you to use the cluster in the correct namespace. ==Authenticating to Kubernetes== We will authorize (authz) you to use the cluster on the server side, but you will also need to authenticate (authn) using your '@ucsc.edu' email address and a unique Java Web token. These credentials are installed in ~/.kube/config in whatever machine you are coming from to get to the cluster. To authenticate and get your base kubernetes configuration, go to this URL (below), which will ask you to authenticate to Google. Use your '@ucsc.edu' email address as the login. It will then ask you to authenticate via CruzID Gold if your web browser doesn't already have the authentication token cached: https://cg-kube-auth.gi.ucsc.edu Once you authenticate (via username/password and 2-factor auth for CruzID Gold), it will pass you back to the 'https://cg-kube-auth.gi.ucsc.edu' website and it should confirm authentication on the top with a message saying "Successfully Authenticated". '''If you see any errors in red,''' but are sure you typed in your password and 2-factor auth correctly, click on the above link again (https://cg-kube-auth.gi.ucsc.edu) and authenticate a second time, which should work. There is a quirk where the web token doesn't always pass back to us correctly on the first try. Upon success, you will be able to click the blue "Download Config File" button, which contains your initial kubernetes config file. Copy this file to your home directory as ~/.kube/config. Follow the directions on the web page to insert your '''"namespace:"''' line as directed. We will let you know which namespace to use. ==Testing Connectivity== Once your ~/.kube/config file is set up correctly, you should be able to connect to the cluster. All our shared servers here at the Genomics Institute have the 'kubectl' command installed on them, but if you are coming from somewhere else, just make sure the "kubectl" utility is installed on that machine. A quick test should go as follows: $ kubectl get nodes NAME STATUS ROLES AGE VERSION k1.kube Ready <none> 13h v1.15.3 k2.kube Ready <none> 13h v1.15.3 master.kube Ready master 13h v1.15.3 ==Running Pods and Jobs with Requests and Limits== When running jobs and pods on kubernetes, you will always want to specify "requests" and "limits" on resources, otherwise you pods will get stuck with the default limits which are tiny (to protect against runaway pods). You should always have an idea of how much resources will be consumed by your jobs, and not use much more than that, in order not to hog all the resources. It also prevents your jib from "running away" unexpectedly and chewing up more resources than expected. Here is a good example of a job file that specifies limits: job.yml apiVersion: batch/v1 kind: Job metadata: name: $USER-$TS spec: backoffLimit: 0 ttlSecondsAfterFinished: 30 template: spec: containers: - name: magic image: robcurrie/ubuntu imagePullPolicy: Always resources: requests: cpu: "1" memory: "2G" ephemeral-storage: "2G" limits: cpu: "2" memory: "3G" ephemeral-storage: "3G" command: ["/bin/bash", "-c"] args: ['for i in {1..100}; do echo "$i: $(date)"; i=$((i+1)); sleep 1; done'] restartPolicy: Never '''NOTE:''' Jobs and pods that have '''completed''' over 48 hours ago but have not been cleaned up will be automatically removed by the garbage collector. Most jobs will have the "ttlSecondsAfterFinished" configuration item in them, so they will automatically cleaned up after that time expires, but leaving the old pods and jobs around pins the disk space they were using while they remain, so it's good to get rid of them as soon as they are done unless you are debugging a failure or something like that. Jobs that '''run''' over 48 hours will not be deleted, only the ones that have '''exited''' over 48 hours ago. A lot of other good information can be viewed on Rob Currie's github page, which includes examples and some "How To" documentation: https://github.com/rcurrie/kubernetes ==View the Cluster's Current Activity== You can take a look at current resource consumption by taking a look at our Ganglia Cluster monitor tool: https://ganglia.gi.ucsc.edu/ That website requires a username and password: username: genecats password: KiloKluster That's mostly for keeping the scrip kiddies and bots from banging on it. Once you get in, you should see a drop-down menu near the top left of the screen near "Genomics Institute Grid". From the drop-down menu, select "CG Kubernetes Cluster". It will take you to a page detailing the current resource usage and activity on the nodes. This can be useful for see if anyone else is using the whole cluster, or just to get an idea of how many resources are available for your batch of jobs to assign to the cluster. 460c4ca4a5f1e864bc8948966d8a83eddde41ad0 156 155 2019-09-14T17:03:49Z Weiler 3 /* View the Cluster's Current Activity */ wikitext text/x-wiki __TOC__ The Computational Genomics Group has a Kubernetes Cluster running on several large instances in AWS. The current cluster makeup includes two worker nodes, each with the following specs: * 96 CPU cores (3.1 GHz) * 384 GB RAM * 3.3 TB Local NVMe Flash Storage * 25 Gb/s Network Interface ==Getting Authorized to Connect== If you require access to this kubernetes cluster, contact Benedict Paten asking for permission to use it, then pass on that permission via email to: cluster-admin@soe.ucsc.edu Let us know which group you are with and we can authorize you to use the cluster in the correct namespace. ==Authenticating to Kubernetes== We will authorize (authz) you to use the cluster on the server side, but you will also need to authenticate (authn) using your '@ucsc.edu' email address and a unique Java Web token. These credentials are installed in ~/.kube/config in whatever machine you are coming from to get to the cluster. To authenticate and get your base kubernetes configuration, go to this URL (below), which will ask you to authenticate to Google. Use your '@ucsc.edu' email address as the login. It will then ask you to authenticate via CruzID Gold if your web browser doesn't already have the authentication token cached: https://cg-kube-auth.gi.ucsc.edu Once you authenticate (via username/password and 2-factor auth for CruzID Gold), it will pass you back to the 'https://cg-kube-auth.gi.ucsc.edu' website and it should confirm authentication on the top with a message saying "Successfully Authenticated". '''If you see any errors in red,''' but are sure you typed in your password and 2-factor auth correctly, click on the above link again (https://cg-kube-auth.gi.ucsc.edu) and authenticate a second time, which should work. There is a quirk where the web token doesn't always pass back to us correctly on the first try. Upon success, you will be able to click the blue "Download Config File" button, which contains your initial kubernetes config file. Copy this file to your home directory as ~/.kube/config. Follow the directions on the web page to insert your '''"namespace:"''' line as directed. We will let you know which namespace to use. ==Testing Connectivity== Once your ~/.kube/config file is set up correctly, you should be able to connect to the cluster. All our shared servers here at the Genomics Institute have the 'kubectl' command installed on them, but if you are coming from somewhere else, just make sure the "kubectl" utility is installed on that machine. A quick test should go as follows: $ kubectl get nodes NAME STATUS ROLES AGE VERSION k1.kube Ready <none> 13h v1.15.3 k2.kube Ready <none> 13h v1.15.3 master.kube Ready master 13h v1.15.3 ==Running Pods and Jobs with Requests and Limits== When running jobs and pods on kubernetes, you will always want to specify "requests" and "limits" on resources, otherwise you pods will get stuck with the default limits which are tiny (to protect against runaway pods). You should always have an idea of how much resources will be consumed by your jobs, and not use much more than that, in order not to hog all the resources. It also prevents your jib from "running away" unexpectedly and chewing up more resources than expected. Here is a good example of a job file that specifies limits: job.yml apiVersion: batch/v1 kind: Job metadata: name: $USER-$TS spec: backoffLimit: 0 ttlSecondsAfterFinished: 30 template: spec: containers: - name: magic image: robcurrie/ubuntu imagePullPolicy: Always resources: requests: cpu: "1" memory: "2G" ephemeral-storage: "2G" limits: cpu: "2" memory: "3G" ephemeral-storage: "3G" command: ["/bin/bash", "-c"] args: ['for i in {1..100}; do echo "$i: $(date)"; i=$((i+1)); sleep 1; done'] restartPolicy: Never '''NOTE:''' Jobs and pods that have '''completed''' over 48 hours ago but have not been cleaned up will be automatically removed by the garbage collector. Most jobs will have the "ttlSecondsAfterFinished" configuration item in them, so they will automatically cleaned up after that time expires, but leaving the old pods and jobs around pins the disk space they were using while they remain, so it's good to get rid of them as soon as they are done unless you are debugging a failure or something like that. Jobs that '''run''' over 48 hours will not be deleted, only the ones that have '''exited''' over 48 hours ago. A lot of other good information can be viewed on Rob Currie's github page, which includes examples and some "How To" documentation: https://github.com/rcurrie/kubernetes ==View the Cluster's Current Activity== One quick way to check the cluster's utilization is to do: kubectl top nodes NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% k1.kube 60m 0% 655Mi 0% k2.kube 51m 0% 625Mi 0% master.kube 97m 4% 1913Mi 93% That means the worker nodes, k1 and k2, are using 0% memory and 0% CPU and are basically fully open for jobs. Ignore the master node as that one only handles cluster management and doesn't run jobs or pods for users. You can also take a look at current resource consumption by taking a look at our Ganglia Cluster monitor tool: https://ganglia.gi.ucsc.edu/ That website requires a username and password: username: genecats password: KiloKluster That's mostly for keeping the scrip kiddies and bots from banging on it. Once you get in, you should see a drop-down menu near the top left of the screen near "Genomics Institute Grid". From the drop-down menu, select "CG Kubernetes Cluster". It will take you to a page detailing the current resource usage and activity on the nodes. This can be useful for see if anyone else is using the whole cluster, or just to get an idea of how many resources are available for your batch of jobs to assign to the cluster. 2ee88d0c17a3cfdb32e92791304b12a59ca322d3 158 156 2019-09-23T17:20:20Z Weiler 3 /* Testing Connectivity */ wikitext text/x-wiki __TOC__ The Computational Genomics Group has a Kubernetes Cluster running on several large instances in AWS. The current cluster makeup includes two worker nodes, each with the following specs: * 96 CPU cores (3.1 GHz) * 384 GB RAM * 3.3 TB Local NVMe Flash Storage * 25 Gb/s Network Interface ==Getting Authorized to Connect== If you require access to this kubernetes cluster, contact Benedict Paten asking for permission to use it, then pass on that permission via email to: cluster-admin@soe.ucsc.edu Let us know which group you are with and we can authorize you to use the cluster in the correct namespace. ==Authenticating to Kubernetes== We will authorize (authz) you to use the cluster on the server side, but you will also need to authenticate (authn) using your '@ucsc.edu' email address and a unique Java Web token. These credentials are installed in ~/.kube/config in whatever machine you are coming from to get to the cluster. To authenticate and get your base kubernetes configuration, go to this URL (below), which will ask you to authenticate to Google. Use your '@ucsc.edu' email address as the login. It will then ask you to authenticate via CruzID Gold if your web browser doesn't already have the authentication token cached: https://cg-kube-auth.gi.ucsc.edu Once you authenticate (via username/password and 2-factor auth for CruzID Gold), it will pass you back to the 'https://cg-kube-auth.gi.ucsc.edu' website and it should confirm authentication on the top with a message saying "Successfully Authenticated". '''If you see any errors in red,''' but are sure you typed in your password and 2-factor auth correctly, click on the above link again (https://cg-kube-auth.gi.ucsc.edu) and authenticate a second time, which should work. There is a quirk where the web token doesn't always pass back to us correctly on the first try. Upon success, you will be able to click the blue "Download Config File" button, which contains your initial kubernetes config file. Copy this file to your home directory as ~/.kube/config. Follow the directions on the web page to insert your '''"namespace:"''' line as directed. We will let you know which namespace to use. ==Testing Connectivity== Once your ~/.kube/config file is set up correctly, you should be able to connect to the cluster. All our shared servers here at the Genomics Institute have the 'kubectl' command installed on them, but if you are coming from somewhere else, just make sure the "kubectl" utility is installed on that machine. A quick test should go as follows: $ kubectl get nodes NAME STATUS ROLES AGE VERSION k1.kube Ready <none> 13h v1.15.3 k2.kube Ready <none> 13h v1.15.3 k3.kube Ready <none> 13h v1.15.3 master.kube Ready master 13h v1.15.3 ==Running Pods and Jobs with Requests and Limits== When running jobs and pods on kubernetes, you will always want to specify "requests" and "limits" on resources, otherwise you pods will get stuck with the default limits which are tiny (to protect against runaway pods). You should always have an idea of how much resources will be consumed by your jobs, and not use much more than that, in order not to hog all the resources. It also prevents your jib from "running away" unexpectedly and chewing up more resources than expected. Here is a good example of a job file that specifies limits: job.yml apiVersion: batch/v1 kind: Job metadata: name: $USER-$TS spec: backoffLimit: 0 ttlSecondsAfterFinished: 30 template: spec: containers: - name: magic image: robcurrie/ubuntu imagePullPolicy: Always resources: requests: cpu: "1" memory: "2G" ephemeral-storage: "2G" limits: cpu: "2" memory: "3G" ephemeral-storage: "3G" command: ["/bin/bash", "-c"] args: ['for i in {1..100}; do echo "$i: $(date)"; i=$((i+1)); sleep 1; done'] restartPolicy: Never '''NOTE:''' Jobs and pods that have '''completed''' over 48 hours ago but have not been cleaned up will be automatically removed by the garbage collector. Most jobs will have the "ttlSecondsAfterFinished" configuration item in them, so they will automatically cleaned up after that time expires, but leaving the old pods and jobs around pins the disk space they were using while they remain, so it's good to get rid of them as soon as they are done unless you are debugging a failure or something like that. Jobs that '''run''' over 48 hours will not be deleted, only the ones that have '''exited''' over 48 hours ago. A lot of other good information can be viewed on Rob Currie's github page, which includes examples and some "How To" documentation: https://github.com/rcurrie/kubernetes ==View the Cluster's Current Activity== One quick way to check the cluster's utilization is to do: kubectl top nodes NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% k1.kube 60m 0% 655Mi 0% k2.kube 51m 0% 625Mi 0% master.kube 97m 4% 1913Mi 93% That means the worker nodes, k1 and k2, are using 0% memory and 0% CPU and are basically fully open for jobs. Ignore the master node as that one only handles cluster management and doesn't run jobs or pods for users. You can also take a look at current resource consumption by taking a look at our Ganglia Cluster monitor tool: https://ganglia.gi.ucsc.edu/ That website requires a username and password: username: genecats password: KiloKluster That's mostly for keeping the scrip kiddies and bots from banging on it. Once you get in, you should see a drop-down menu near the top left of the screen near "Genomics Institute Grid". From the drop-down menu, select "CG Kubernetes Cluster". It will take you to a page detailing the current resource usage and activity on the nodes. This can be useful for see if anyone else is using the whole cluster, or just to get an idea of how many resources are available for your batch of jobs to assign to the cluster. 0699d99ce4a8db645e9f417c02a858a46b078567 159 158 2019-09-23T17:23:02Z Weiler 3 /* View the Cluster's Current Activity */ wikitext text/x-wiki __TOC__ The Computational Genomics Group has a Kubernetes Cluster running on several large instances in AWS. The current cluster makeup includes two worker nodes, each with the following specs: * 96 CPU cores (3.1 GHz) * 384 GB RAM * 3.3 TB Local NVMe Flash Storage * 25 Gb/s Network Interface ==Getting Authorized to Connect== If you require access to this kubernetes cluster, contact Benedict Paten asking for permission to use it, then pass on that permission via email to: cluster-admin@soe.ucsc.edu Let us know which group you are with and we can authorize you to use the cluster in the correct namespace. ==Authenticating to Kubernetes== We will authorize (authz) you to use the cluster on the server side, but you will also need to authenticate (authn) using your '@ucsc.edu' email address and a unique Java Web token. These credentials are installed in ~/.kube/config in whatever machine you are coming from to get to the cluster. To authenticate and get your base kubernetes configuration, go to this URL (below), which will ask you to authenticate to Google. Use your '@ucsc.edu' email address as the login. It will then ask you to authenticate via CruzID Gold if your web browser doesn't already have the authentication token cached: https://cg-kube-auth.gi.ucsc.edu Once you authenticate (via username/password and 2-factor auth for CruzID Gold), it will pass you back to the 'https://cg-kube-auth.gi.ucsc.edu' website and it should confirm authentication on the top with a message saying "Successfully Authenticated". '''If you see any errors in red,''' but are sure you typed in your password and 2-factor auth correctly, click on the above link again (https://cg-kube-auth.gi.ucsc.edu) and authenticate a second time, which should work. There is a quirk where the web token doesn't always pass back to us correctly on the first try. Upon success, you will be able to click the blue "Download Config File" button, which contains your initial kubernetes config file. Copy this file to your home directory as ~/.kube/config. Follow the directions on the web page to insert your '''"namespace:"''' line as directed. We will let you know which namespace to use. ==Testing Connectivity== Once your ~/.kube/config file is set up correctly, you should be able to connect to the cluster. All our shared servers here at the Genomics Institute have the 'kubectl' command installed on them, but if you are coming from somewhere else, just make sure the "kubectl" utility is installed on that machine. A quick test should go as follows: $ kubectl get nodes NAME STATUS ROLES AGE VERSION k1.kube Ready <none> 13h v1.15.3 k2.kube Ready <none> 13h v1.15.3 k3.kube Ready <none> 13h v1.15.3 master.kube Ready master 13h v1.15.3 ==Running Pods and Jobs with Requests and Limits== When running jobs and pods on kubernetes, you will always want to specify "requests" and "limits" on resources, otherwise you pods will get stuck with the default limits which are tiny (to protect against runaway pods). You should always have an idea of how much resources will be consumed by your jobs, and not use much more than that, in order not to hog all the resources. It also prevents your jib from "running away" unexpectedly and chewing up more resources than expected. Here is a good example of a job file that specifies limits: job.yml apiVersion: batch/v1 kind: Job metadata: name: $USER-$TS spec: backoffLimit: 0 ttlSecondsAfterFinished: 30 template: spec: containers: - name: magic image: robcurrie/ubuntu imagePullPolicy: Always resources: requests: cpu: "1" memory: "2G" ephemeral-storage: "2G" limits: cpu: "2" memory: "3G" ephemeral-storage: "3G" command: ["/bin/bash", "-c"] args: ['for i in {1..100}; do echo "$i: $(date)"; i=$((i+1)); sleep 1; done'] restartPolicy: Never '''NOTE:''' Jobs and pods that have '''completed''' over 48 hours ago but have not been cleaned up will be automatically removed by the garbage collector. Most jobs will have the "ttlSecondsAfterFinished" configuration item in them, so they will automatically cleaned up after that time expires, but leaving the old pods and jobs around pins the disk space they were using while they remain, so it's good to get rid of them as soon as they are done unless you are debugging a failure or something like that. Jobs that '''run''' over 48 hours will not be deleted, only the ones that have '''exited''' over 48 hours ago. A lot of other good information can be viewed on Rob Currie's github page, which includes examples and some "How To" documentation: https://github.com/rcurrie/kubernetes ==View the Cluster's Current Activity== One quick way to check the cluster's utilization is to do: kubectl top nodes NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% k1.kube 1815m 1% 1191Mi 0% k2.kube 51837m 53% 46507Mi 12% k3.kube 1458m 1% 61270Mi 15% master.kube 111m 5% 1024Mi 46% That means the worker nodes, k1, k2 and k3, are using minimal memory, k2 is using 52% CPU but lots of room still open for new jobs. Ignore the master node as that one only handles cluster management and doesn't run jobs or pods for users. You can also take a look at current resource consumption by taking a look at our Ganglia Cluster monitor tool: https://ganglia.gi.ucsc.edu/ That website requires a username and password: username: genecats password: KiloKluster That's mostly for keeping the scrip kiddies and bots from banging on it. Once you get in, you should see a drop-down menu near the top left of the screen near "Genomics Institute Grid". From the drop-down menu, select "CG Kubernetes Cluster". It will take you to a page detailing the current resource usage and activity on the nodes. This can be useful for see if anyone else is using the whole cluster, or just to get an idea of how many resources are available for your batch of jobs to assign to the cluster. 92abbb78312957a95257d43db65685f2f46580b1 160 159 2019-09-23T18:11:17Z Weiler 3 wikitext text/x-wiki __TOC__ The Computational Genomics Group has a Kubernetes Cluster running on several large instances in AWS. The current cluster makeup includes three worker nodes, each with the following specs: * 96 CPU cores (3.1 GHz) * 384 GB RAM * 3.3 TB Local NVMe Flash Storage * 25 Gb/s Network Interface ==Getting Authorized to Connect== If you require access to this kubernetes cluster, contact Benedict Paten asking for permission to use it, then pass on that permission via email to: cluster-admin@soe.ucsc.edu Let us know which group you are with and we can authorize you to use the cluster in the correct namespace. ==Authenticating to Kubernetes== We will authorize (authz) you to use the cluster on the server side, but you will also need to authenticate (authn) using your '@ucsc.edu' email address and a unique Java Web token. These credentials are installed in ~/.kube/config in whatever machine you are coming from to get to the cluster. To authenticate and get your base kubernetes configuration, go to this URL (below), which will ask you to authenticate to Google. Use your '@ucsc.edu' email address as the login. It will then ask you to authenticate via CruzID Gold if your web browser doesn't already have the authentication token cached: https://cg-kube-auth.gi.ucsc.edu Once you authenticate (via username/password and 2-factor auth for CruzID Gold), it will pass you back to the 'https://cg-kube-auth.gi.ucsc.edu' website and it should confirm authentication on the top with a message saying "Successfully Authenticated". '''If you see any errors in red,''' but are sure you typed in your password and 2-factor auth correctly, click on the above link again (https://cg-kube-auth.gi.ucsc.edu) and authenticate a second time, which should work. There is a quirk where the web token doesn't always pass back to us correctly on the first try. Upon success, you will be able to click the blue "Download Config File" button, which contains your initial kubernetes config file. Copy this file to your home directory as ~/.kube/config. Follow the directions on the web page to insert your '''"namespace:"''' line as directed. We will let you know which namespace to use. ==Testing Connectivity== Once your ~/.kube/config file is set up correctly, you should be able to connect to the cluster. All our shared servers here at the Genomics Institute have the 'kubectl' command installed on them, but if you are coming from somewhere else, just make sure the "kubectl" utility is installed on that machine. A quick test should go as follows: $ kubectl get nodes NAME STATUS ROLES AGE VERSION k1.kube Ready <none> 13h v1.15.3 k2.kube Ready <none> 13h v1.15.3 k3.kube Ready <none> 13h v1.15.3 master.kube Ready master 13h v1.15.3 ==Running Pods and Jobs with Requests and Limits== When running jobs and pods on kubernetes, you will always want to specify "requests" and "limits" on resources, otherwise you pods will get stuck with the default limits which are tiny (to protect against runaway pods). You should always have an idea of how much resources will be consumed by your jobs, and not use much more than that, in order not to hog all the resources. It also prevents your jib from "running away" unexpectedly and chewing up more resources than expected. Here is a good example of a job file that specifies limits: job.yml apiVersion: batch/v1 kind: Job metadata: name: $USER-$TS spec: backoffLimit: 0 ttlSecondsAfterFinished: 30 template: spec: containers: - name: magic image: robcurrie/ubuntu imagePullPolicy: Always resources: requests: cpu: "1" memory: "2G" ephemeral-storage: "2G" limits: cpu: "2" memory: "3G" ephemeral-storage: "3G" command: ["/bin/bash", "-c"] args: ['for i in {1..100}; do echo "$i: $(date)"; i=$((i+1)); sleep 1; done'] restartPolicy: Never '''NOTE:''' Jobs and pods that have '''completed''' over 48 hours ago but have not been cleaned up will be automatically removed by the garbage collector. Most jobs will have the "ttlSecondsAfterFinished" configuration item in them, so they will automatically cleaned up after that time expires, but leaving the old pods and jobs around pins the disk space they were using while they remain, so it's good to get rid of them as soon as they are done unless you are debugging a failure or something like that. Jobs that '''run''' over 48 hours will not be deleted, only the ones that have '''exited''' over 48 hours ago. A lot of other good information can be viewed on Rob Currie's github page, which includes examples and some "How To" documentation: https://github.com/rcurrie/kubernetes ==View the Cluster's Current Activity== One quick way to check the cluster's utilization is to do: kubectl top nodes NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% k1.kube 1815m 1% 1191Mi 0% k2.kube 51837m 53% 46507Mi 12% k3.kube 1458m 1% 61270Mi 15% master.kube 111m 5% 1024Mi 46% That means the worker nodes, k1, k2 and k3, are using minimal memory, k2 is using 52% CPU but lots of room still open for new jobs. Ignore the master node as that one only handles cluster management and doesn't run jobs or pods for users. You can also take a look at current resource consumption by taking a look at our Ganglia Cluster monitor tool: https://ganglia.gi.ucsc.edu/ That website requires a username and password: username: genecats password: KiloKluster That's mostly for keeping the scrip kiddies and bots from banging on it. Once you get in, you should see a drop-down menu near the top left of the screen near "Genomics Institute Grid". From the drop-down menu, select "CG Kubernetes Cluster". It will take you to a page detailing the current resource usage and activity on the nodes. This can be useful for see if anyone else is using the whole cluster, or just to get an idea of how many resources are available for your batch of jobs to assign to the cluster. 851b38228b7e561dc5fb6491865e06f7d5c1205b 161 160 2019-09-23T19:40:55Z Weiler 3 /* Running Pods and Jobs with Requests and Limits */ wikitext text/x-wiki __TOC__ The Computational Genomics Group has a Kubernetes Cluster running on several large instances in AWS. The current cluster makeup includes three worker nodes, each with the following specs: * 96 CPU cores (3.1 GHz) * 384 GB RAM * 3.3 TB Local NVMe Flash Storage * 25 Gb/s Network Interface ==Getting Authorized to Connect== If you require access to this kubernetes cluster, contact Benedict Paten asking for permission to use it, then pass on that permission via email to: cluster-admin@soe.ucsc.edu Let us know which group you are with and we can authorize you to use the cluster in the correct namespace. ==Authenticating to Kubernetes== We will authorize (authz) you to use the cluster on the server side, but you will also need to authenticate (authn) using your '@ucsc.edu' email address and a unique Java Web token. These credentials are installed in ~/.kube/config in whatever machine you are coming from to get to the cluster. To authenticate and get your base kubernetes configuration, go to this URL (below), which will ask you to authenticate to Google. Use your '@ucsc.edu' email address as the login. It will then ask you to authenticate via CruzID Gold if your web browser doesn't already have the authentication token cached: https://cg-kube-auth.gi.ucsc.edu Once you authenticate (via username/password and 2-factor auth for CruzID Gold), it will pass you back to the 'https://cg-kube-auth.gi.ucsc.edu' website and it should confirm authentication on the top with a message saying "Successfully Authenticated". '''If you see any errors in red,''' but are sure you typed in your password and 2-factor auth correctly, click on the above link again (https://cg-kube-auth.gi.ucsc.edu) and authenticate a second time, which should work. There is a quirk where the web token doesn't always pass back to us correctly on the first try. Upon success, you will be able to click the blue "Download Config File" button, which contains your initial kubernetes config file. Copy this file to your home directory as ~/.kube/config. Follow the directions on the web page to insert your '''"namespace:"''' line as directed. We will let you know which namespace to use. ==Testing Connectivity== Once your ~/.kube/config file is set up correctly, you should be able to connect to the cluster. All our shared servers here at the Genomics Institute have the 'kubectl' command installed on them, but if you are coming from somewhere else, just make sure the "kubectl" utility is installed on that machine. A quick test should go as follows: $ kubectl get nodes NAME STATUS ROLES AGE VERSION k1.kube Ready <none> 13h v1.15.3 k2.kube Ready <none> 13h v1.15.3 k3.kube Ready <none> 13h v1.15.3 master.kube Ready master 13h v1.15.3 ==Running Pods and Jobs with Requests and Limits== When running jobs and pods on kubernetes, you will always want to specify "requests" and "limits" on resources, otherwise your pods will get stuck with the default limits which are tiny (to protect against runaway pods). You should always have an idea of how much resources will be consumed by your jobs, and not use much more than that, in order not to hog all the resources. It also prevents your jib from "running away" unexpectedly and chewing up more resources than expected. Here is a good example of a job file that specifies limits: job.yml apiVersion: batch/v1 kind: Job metadata: name: $USER-$TS spec: backoffLimit: 0 ttlSecondsAfterFinished: 30 template: spec: containers: - name: magic image: robcurrie/ubuntu imagePullPolicy: Always resources: requests: cpu: "1" memory: "2G" ephemeral-storage: "2G" limits: cpu: "2" memory: "3G" ephemeral-storage: "3G" command: ["/bin/bash", "-c"] args: ['for i in {1..100}; do echo "$i: $(date)"; i=$((i+1)); sleep 1; done'] restartPolicy: Never '''NOTE:''' Jobs and pods that have '''completed''' over 48 hours ago but have not been cleaned up will be automatically removed by the garbage collector. Most jobs will have the "ttlSecondsAfterFinished" configuration item in them, so they will automatically cleaned up after that time expires, but leaving the old pods and jobs around pins the disk space they were using while they remain, so it's good to get rid of them as soon as they are done unless you are debugging a failure or something like that. Jobs that '''run''' over 48 hours will not be deleted, only the ones that have '''exited''' over 48 hours ago. A lot of other good information can be viewed on Rob Currie's github page, which includes examples and some "How To" documentation: https://github.com/rcurrie/kubernetes ==View the Cluster's Current Activity== One quick way to check the cluster's utilization is to do: kubectl top nodes NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% k1.kube 1815m 1% 1191Mi 0% k2.kube 51837m 53% 46507Mi 12% k3.kube 1458m 1% 61270Mi 15% master.kube 111m 5% 1024Mi 46% That means the worker nodes, k1, k2 and k3, are using minimal memory, k2 is using 52% CPU but lots of room still open for new jobs. Ignore the master node as that one only handles cluster management and doesn't run jobs or pods for users. You can also take a look at current resource consumption by taking a look at our Ganglia Cluster monitor tool: https://ganglia.gi.ucsc.edu/ That website requires a username and password: username: genecats password: KiloKluster That's mostly for keeping the scrip kiddies and bots from banging on it. Once you get in, you should see a drop-down menu near the top left of the screen near "Genomics Institute Grid". From the drop-down menu, select "CG Kubernetes Cluster". It will take you to a page detailing the current resource usage and activity on the nodes. This can be useful for see if anyone else is using the whole cluster, or just to get an idea of how many resources are available for your batch of jobs to assign to the cluster. a4c12c781a54552d4f1eb24421c83d0b201b9ada 163 161 2019-12-02T23:32:26Z Weiler 3 /* Running Pods and Jobs with Requests and Limits */ wikitext text/x-wiki __TOC__ The Computational Genomics Group has a Kubernetes Cluster running on several large instances in AWS. The current cluster makeup includes three worker nodes, each with the following specs: * 96 CPU cores (3.1 GHz) * 384 GB RAM * 3.3 TB Local NVMe Flash Storage * 25 Gb/s Network Interface ==Getting Authorized to Connect== If you require access to this kubernetes cluster, contact Benedict Paten asking for permission to use it, then pass on that permission via email to: cluster-admin@soe.ucsc.edu Let us know which group you are with and we can authorize you to use the cluster in the correct namespace. ==Authenticating to Kubernetes== We will authorize (authz) you to use the cluster on the server side, but you will also need to authenticate (authn) using your '@ucsc.edu' email address and a unique Java Web token. These credentials are installed in ~/.kube/config in whatever machine you are coming from to get to the cluster. To authenticate and get your base kubernetes configuration, go to this URL (below), which will ask you to authenticate to Google. Use your '@ucsc.edu' email address as the login. It will then ask you to authenticate via CruzID Gold if your web browser doesn't already have the authentication token cached: https://cg-kube-auth.gi.ucsc.edu Once you authenticate (via username/password and 2-factor auth for CruzID Gold), it will pass you back to the 'https://cg-kube-auth.gi.ucsc.edu' website and it should confirm authentication on the top with a message saying "Successfully Authenticated". '''If you see any errors in red,''' but are sure you typed in your password and 2-factor auth correctly, click on the above link again (https://cg-kube-auth.gi.ucsc.edu) and authenticate a second time, which should work. There is a quirk where the web token doesn't always pass back to us correctly on the first try. Upon success, you will be able to click the blue "Download Config File" button, which contains your initial kubernetes config file. Copy this file to your home directory as ~/.kube/config. Follow the directions on the web page to insert your '''"namespace:"''' line as directed. We will let you know which namespace to use. ==Testing Connectivity== Once your ~/.kube/config file is set up correctly, you should be able to connect to the cluster. All our shared servers here at the Genomics Institute have the 'kubectl' command installed on them, but if you are coming from somewhere else, just make sure the "kubectl" utility is installed on that machine. A quick test should go as follows: $ kubectl get nodes NAME STATUS ROLES AGE VERSION k1.kube Ready <none> 13h v1.15.3 k2.kube Ready <none> 13h v1.15.3 k3.kube Ready <none> 13h v1.15.3 master.kube Ready master 13h v1.15.3 ==Running Pods and Jobs with Requests and Limits== When running jobs and pods on kubernetes, you will always want to specify "requests" and "limits" on resources, otherwise your pods will get stuck with the default limits which are tiny (to protect against runaway pods). You should always have an idea of how much resources will be consumed by your jobs, and not use much more than that, in order not to hog all the resources. It also prevents your jib from "running away" unexpectedly and chewing up more resources than expected. Here is a good example of a job file that specifies limits: job.yml apiVersion: batch/v1 kind: Job metadata: name: $USER-$TS spec: backoffLimit: 0 ttlSecondsAfterFinished: 30 template: spec: containers: - name: magic image: robcurrie/ubuntu imagePullPolicy: Always resources: requests: cpu: "1" memory: "2G" ephemeral-storage: "2G" limits: cpu: "1" memory: "2G" ephemeral-storage: "3G" command: ["/bin/bash", "-c"] args: ['for i in {1..100}; do echo "$i: $(date)"; i=$((i+1)); sleep 1; done'] restartPolicy: Never Please note that the "request" and "limit" item fields should be the same. You would think that you could set the limit higher than the request, but in reality they need to match in order for the pod to stay within the kubernetes resource limit bubble. If you set the limit higher than the request, then you are risking the pod using more memory than the scheduler expects, and the node can start killing off random other colocated pods by way of OOM, which is very, very bad for the cluster. '''NOTE:''' Jobs and pods that have '''completed''' over 72 hours ago but have not been cleaned up will be automatically removed by the garbage collector. Most jobs will have the "ttlSecondsAfterFinished" configuration item in them, so they will automatically cleaned up after that time expires, but leaving the old pods and jobs around pins the disk space they were using while they remain, so it's good to get rid of them as soon as they are done unless you are debugging a failure or something like that. Jobs that '''run''' over 72 hours will not be deleted, only the ones that have '''exited''' over 72 hours ago. A lot of other good information can be viewed on Rob Currie's github page, which includes examples and some "How To" documentation: https://github.com/rcurrie/kubernetes ==View the Cluster's Current Activity== One quick way to check the cluster's utilization is to do: kubectl top nodes NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% k1.kube 1815m 1% 1191Mi 0% k2.kube 51837m 53% 46507Mi 12% k3.kube 1458m 1% 61270Mi 15% master.kube 111m 5% 1024Mi 46% That means the worker nodes, k1, k2 and k3, are using minimal memory, k2 is using 52% CPU but lots of room still open for new jobs. Ignore the master node as that one only handles cluster management and doesn't run jobs or pods for users. You can also take a look at current resource consumption by taking a look at our Ganglia Cluster monitor tool: https://ganglia.gi.ucsc.edu/ That website requires a username and password: username: genecats password: KiloKluster That's mostly for keeping the scrip kiddies and bots from banging on it. Once you get in, you should see a drop-down menu near the top left of the screen near "Genomics Institute Grid". From the drop-down menu, select "CG Kubernetes Cluster". It will take you to a page detailing the current resource usage and activity on the nodes. This can be useful for see if anyone else is using the whole cluster, or just to get an idea of how many resources are available for your batch of jobs to assign to the cluster. a5ab0e05854b1075c5142967647081eb38c235cf 164 163 2019-12-06T21:12:33Z Weiler 3 /* View the Cluster's Current Activity */ wikitext text/x-wiki __TOC__ The Computational Genomics Group has a Kubernetes Cluster running on several large instances in AWS. The current cluster makeup includes three worker nodes, each with the following specs: * 96 CPU cores (3.1 GHz) * 384 GB RAM * 3.3 TB Local NVMe Flash Storage * 25 Gb/s Network Interface ==Getting Authorized to Connect== If you require access to this kubernetes cluster, contact Benedict Paten asking for permission to use it, then pass on that permission via email to: cluster-admin@soe.ucsc.edu Let us know which group you are with and we can authorize you to use the cluster in the correct namespace. ==Authenticating to Kubernetes== We will authorize (authz) you to use the cluster on the server side, but you will also need to authenticate (authn) using your '@ucsc.edu' email address and a unique Java Web token. These credentials are installed in ~/.kube/config in whatever machine you are coming from to get to the cluster. To authenticate and get your base kubernetes configuration, go to this URL (below), which will ask you to authenticate to Google. Use your '@ucsc.edu' email address as the login. It will then ask you to authenticate via CruzID Gold if your web browser doesn't already have the authentication token cached: https://cg-kube-auth.gi.ucsc.edu Once you authenticate (via username/password and 2-factor auth for CruzID Gold), it will pass you back to the 'https://cg-kube-auth.gi.ucsc.edu' website and it should confirm authentication on the top with a message saying "Successfully Authenticated". '''If you see any errors in red,''' but are sure you typed in your password and 2-factor auth correctly, click on the above link again (https://cg-kube-auth.gi.ucsc.edu) and authenticate a second time, which should work. There is a quirk where the web token doesn't always pass back to us correctly on the first try. Upon success, you will be able to click the blue "Download Config File" button, which contains your initial kubernetes config file. Copy this file to your home directory as ~/.kube/config. Follow the directions on the web page to insert your '''"namespace:"''' line as directed. We will let you know which namespace to use. ==Testing Connectivity== Once your ~/.kube/config file is set up correctly, you should be able to connect to the cluster. All our shared servers here at the Genomics Institute have the 'kubectl' command installed on them, but if you are coming from somewhere else, just make sure the "kubectl" utility is installed on that machine. A quick test should go as follows: $ kubectl get nodes NAME STATUS ROLES AGE VERSION k1.kube Ready <none> 13h v1.15.3 k2.kube Ready <none> 13h v1.15.3 k3.kube Ready <none> 13h v1.15.3 master.kube Ready master 13h v1.15.3 ==Running Pods and Jobs with Requests and Limits== When running jobs and pods on kubernetes, you will always want to specify "requests" and "limits" on resources, otherwise your pods will get stuck with the default limits which are tiny (to protect against runaway pods). You should always have an idea of how much resources will be consumed by your jobs, and not use much more than that, in order not to hog all the resources. It also prevents your jib from "running away" unexpectedly and chewing up more resources than expected. Here is a good example of a job file that specifies limits: job.yml apiVersion: batch/v1 kind: Job metadata: name: $USER-$TS spec: backoffLimit: 0 ttlSecondsAfterFinished: 30 template: spec: containers: - name: magic image: robcurrie/ubuntu imagePullPolicy: Always resources: requests: cpu: "1" memory: "2G" ephemeral-storage: "2G" limits: cpu: "1" memory: "2G" ephemeral-storage: "3G" command: ["/bin/bash", "-c"] args: ['for i in {1..100}; do echo "$i: $(date)"; i=$((i+1)); sleep 1; done'] restartPolicy: Never Please note that the "request" and "limit" item fields should be the same. You would think that you could set the limit higher than the request, but in reality they need to match in order for the pod to stay within the kubernetes resource limit bubble. If you set the limit higher than the request, then you are risking the pod using more memory than the scheduler expects, and the node can start killing off random other colocated pods by way of OOM, which is very, very bad for the cluster. '''NOTE:''' Jobs and pods that have '''completed''' over 72 hours ago but have not been cleaned up will be automatically removed by the garbage collector. Most jobs will have the "ttlSecondsAfterFinished" configuration item in them, so they will automatically cleaned up after that time expires, but leaving the old pods and jobs around pins the disk space they were using while they remain, so it's good to get rid of them as soon as they are done unless you are debugging a failure or something like that. Jobs that '''run''' over 72 hours will not be deleted, only the ones that have '''exited''' over 72 hours ago. A lot of other good information can be viewed on Rob Currie's github page, which includes examples and some "How To" documentation: https://github.com/rcurrie/kubernetes ==View the Cluster's Current Activity== One quick way to check the cluster's utilization is to do: kubectl top nodes NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% k1.kube 1815m 1% 1191Mi 0% k2.kube 51837m 53% 46507Mi 12% k3.kube 1458m 1% 61270Mi 15% master.kube 111m 5% 1024Mi 46% That means the worker nodes, k1, k2 and k3, are using minimal memory, k2 is using 52% CPU but lots of room still open for new jobs. Ignore the master node as that one only handles cluster management and doesn't run jobs or pods for users. Another good way to get a lot of details about the current state of the cluster is through the Kubernetes Dashboard: https://cgl-k8s-dashboard.gi.ucsc.edu/ Select the "token" login method, and paste in this(long) token: eyJhbGciOiJSUzI1NiIsImtpZCI6IiJ9.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJrdWJlcm5ldGVzLWRhc2hib2FyZCIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJrdWJlcm5ldGVzLWRhc2hib2FyZC10b2tlbi0ycDY4cCIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50Lm5hbWUiOiJrdWJlcm5ldGVzLWRhc2hib2FyZCIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50LnVpZCI6ImE5ZGI2Y2I0LWMyYWYtNDM3My04ZmM2LWE4YWYwYTBmNGRkNCIsInN1YiI6InN5c3RlbTpzZXJ2aWNlYWNjb3VudDprdWJlcm5ldGVzLWRhc2hib2FyZDprdWJlcm5ldGVzLWRhc2hib2FyZCJ9.ZVQjryG_ksfvReIfq4Frb6M4sE6OVDOXnFy9Aii-h3mrpdHRE6bgjdAvSGZ0jJSIUEz5GgPBQ0lCwhyZocivHHr4zTrNxMkOFZhPDnpvF6RVIDWTkqmH9Dg6qmro0gTJP75oKBpt7dFN2pW4zvqOAzqPmh7qxfoVusN8X6U13YirMFEf65-aGL-_FFNBsEzvjkC-BgXWbtk3YZc8CJL7xtvlKLyE6u6jC9Qx0SWnwzkALlxmzo_yYTDKpIrWiQGEqzLQOxKml-H0kSYLDX-t4sTivXp4vCw_ruoqwIpLnnQAC7q3ZtSTxHIrxbB7n_M8gfhpXtwprbPav-XmBk1xaQ The dashboard is read-only, so you won't be able to edit anything, it's mostly for seeing what's going on and where. You can also take a look at current resource consumption by taking a look at our Ganglia Cluster monitor tool: https://ganglia.gi.ucsc.edu/ That website requires a username and password: username: genecats password: KiloKluster That's mostly for keeping the scrip kiddies and bots from banging on it. Once you get in, you should see a drop-down menu near the top left of the screen near "Genomics Institute Grid". From the drop-down menu, select "CG Kubernetes Cluster". It will take you to a page detailing the current resource usage and activity on the nodes. This can be useful for see if anyone else is using the whole cluster, or just to get an idea of how many resources are available for your batch of jobs to assign to the cluster. 172607f3053ab16dcd66cf09d8cce4fb08a89e3a 165 164 2019-12-06T21:12:58Z Weiler 3 /* View the Cluster's Current Activity */ wikitext text/x-wiki __TOC__ The Computational Genomics Group has a Kubernetes Cluster running on several large instances in AWS. The current cluster makeup includes three worker nodes, each with the following specs: * 96 CPU cores (3.1 GHz) * 384 GB RAM * 3.3 TB Local NVMe Flash Storage * 25 Gb/s Network Interface ==Getting Authorized to Connect== If you require access to this kubernetes cluster, contact Benedict Paten asking for permission to use it, then pass on that permission via email to: cluster-admin@soe.ucsc.edu Let us know which group you are with and we can authorize you to use the cluster in the correct namespace. ==Authenticating to Kubernetes== We will authorize (authz) you to use the cluster on the server side, but you will also need to authenticate (authn) using your '@ucsc.edu' email address and a unique Java Web token. These credentials are installed in ~/.kube/config in whatever machine you are coming from to get to the cluster. To authenticate and get your base kubernetes configuration, go to this URL (below), which will ask you to authenticate to Google. Use your '@ucsc.edu' email address as the login. It will then ask you to authenticate via CruzID Gold if your web browser doesn't already have the authentication token cached: https://cg-kube-auth.gi.ucsc.edu Once you authenticate (via username/password and 2-factor auth for CruzID Gold), it will pass you back to the 'https://cg-kube-auth.gi.ucsc.edu' website and it should confirm authentication on the top with a message saying "Successfully Authenticated". '''If you see any errors in red,''' but are sure you typed in your password and 2-factor auth correctly, click on the above link again (https://cg-kube-auth.gi.ucsc.edu) and authenticate a second time, which should work. There is a quirk where the web token doesn't always pass back to us correctly on the first try. Upon success, you will be able to click the blue "Download Config File" button, which contains your initial kubernetes config file. Copy this file to your home directory as ~/.kube/config. Follow the directions on the web page to insert your '''"namespace:"''' line as directed. We will let you know which namespace to use. ==Testing Connectivity== Once your ~/.kube/config file is set up correctly, you should be able to connect to the cluster. All our shared servers here at the Genomics Institute have the 'kubectl' command installed on them, but if you are coming from somewhere else, just make sure the "kubectl" utility is installed on that machine. A quick test should go as follows: $ kubectl get nodes NAME STATUS ROLES AGE VERSION k1.kube Ready <none> 13h v1.15.3 k2.kube Ready <none> 13h v1.15.3 k3.kube Ready <none> 13h v1.15.3 master.kube Ready master 13h v1.15.3 ==Running Pods and Jobs with Requests and Limits== When running jobs and pods on kubernetes, you will always want to specify "requests" and "limits" on resources, otherwise your pods will get stuck with the default limits which are tiny (to protect against runaway pods). You should always have an idea of how much resources will be consumed by your jobs, and not use much more than that, in order not to hog all the resources. It also prevents your jib from "running away" unexpectedly and chewing up more resources than expected. Here is a good example of a job file that specifies limits: job.yml apiVersion: batch/v1 kind: Job metadata: name: $USER-$TS spec: backoffLimit: 0 ttlSecondsAfterFinished: 30 template: spec: containers: - name: magic image: robcurrie/ubuntu imagePullPolicy: Always resources: requests: cpu: "1" memory: "2G" ephemeral-storage: "2G" limits: cpu: "1" memory: "2G" ephemeral-storage: "3G" command: ["/bin/bash", "-c"] args: ['for i in {1..100}; do echo "$i: $(date)"; i=$((i+1)); sleep 1; done'] restartPolicy: Never Please note that the "request" and "limit" item fields should be the same. You would think that you could set the limit higher than the request, but in reality they need to match in order for the pod to stay within the kubernetes resource limit bubble. If you set the limit higher than the request, then you are risking the pod using more memory than the scheduler expects, and the node can start killing off random other colocated pods by way of OOM, which is very, very bad for the cluster. '''NOTE:''' Jobs and pods that have '''completed''' over 72 hours ago but have not been cleaned up will be automatically removed by the garbage collector. Most jobs will have the "ttlSecondsAfterFinished" configuration item in them, so they will automatically cleaned up after that time expires, but leaving the old pods and jobs around pins the disk space they were using while they remain, so it's good to get rid of them as soon as they are done unless you are debugging a failure or something like that. Jobs that '''run''' over 72 hours will not be deleted, only the ones that have '''exited''' over 72 hours ago. A lot of other good information can be viewed on Rob Currie's github page, which includes examples and some "How To" documentation: https://github.com/rcurrie/kubernetes ==View the Cluster's Current Activity== One quick way to check the cluster's utilization is to do: kubectl top nodes NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% k1.kube 1815m 1% 1191Mi 0% k2.kube 51837m 53% 46507Mi 12% k3.kube 1458m 1% 61270Mi 15% master.kube 111m 5% 1024Mi 46% That means the worker nodes, k1, k2 and k3, are using minimal memory, k2 is using 52% CPU but lots of room still open for new jobs. Ignore the master node as that one only handles cluster management and doesn't run jobs or pods for users. Another good way to get a lot of details about the current state of the cluster is through the Kubernetes Dashboard: https://cgl-k8s-dashboard.gi.ucsc.edu/ Select the "token" login method, and paste in this (long) token: eyJhbGciOiJSUzI1NiIsImtpZCI6IiJ9.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJrdWJlcm5ldGVzLWRhc2hib2FyZCIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJrdWJlcm5ldGVzLWRhc2hib2FyZC10b2tlbi0ycDY4cCIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50Lm5hbWUiOiJrdWJlcm5ldGVzLWRhc2hib2FyZCIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50LnVpZCI6ImE5ZGI2Y2I0LWMyYWYtNDM3My04ZmM2LWE4YWYwYTBmNGRkNCIsInN1YiI6InN5c3RlbTpzZXJ2aWNlYWNjb3VudDprdWJlcm5ldGVzLWRhc2hib2FyZDprdWJlcm5ldGVzLWRhc2hib2FyZCJ9.ZVQjryG_ksfvReIfq4Frb6M4sE6OVDOXnFy9Aii-h3mrpdHRE6bgjdAvSGZ0jJSIUEz5GgPBQ0lCwhyZocivHHr4zTrNxMkOFZhPDnpvF6RVIDWTkqmH9Dg6qmro0gTJP75oKBpt7dFN2pW4zvqOAzqPmh7qxfoVusN8X6U13YirMFEf65-aGL-_FFNBsEzvjkC-BgXWbtk3YZc8CJL7xtvlKLyE6u6jC9Qx0SWnwzkALlxmzo_yYTDKpIrWiQGEqzLQOxKml-H0kSYLDX-t4sTivXp4vCw_ruoqwIpLnnQAC7q3ZtSTxHIrxbB7n_M8gfhpXtwprbPav-XmBk1xaQ The dashboard is read-only, so you won't be able to edit anything, it's mostly for seeing what's going on and where. You can also take a look at current resource consumption by taking a look at our Ganglia Cluster monitor tool: https://ganglia.gi.ucsc.edu/ That website requires a username and password: username: genecats password: KiloKluster That's mostly for keeping the scrip kiddies and bots from banging on it. Once you get in, you should see a drop-down menu near the top left of the screen near "Genomics Institute Grid". From the drop-down menu, select "CG Kubernetes Cluster". It will take you to a page detailing the current resource usage and activity on the nodes. This can be useful for see if anyone else is using the whole cluster, or just to get an idea of how many resources are available for your batch of jobs to assign to the cluster. 7ae2d0b83584b006e70ab81949dd28ca08cc8d44 177 165 2020-02-05T22:34:40Z Weiler 3 /* Running Pods and Jobs with Requests and Limits */ wikitext text/x-wiki __TOC__ The Computational Genomics Group has a Kubernetes Cluster running on several large instances in AWS. The current cluster makeup includes three worker nodes, each with the following specs: * 96 CPU cores (3.1 GHz) * 384 GB RAM * 3.3 TB Local NVMe Flash Storage * 25 Gb/s Network Interface ==Getting Authorized to Connect== If you require access to this kubernetes cluster, contact Benedict Paten asking for permission to use it, then pass on that permission via email to: cluster-admin@soe.ucsc.edu Let us know which group you are with and we can authorize you to use the cluster in the correct namespace. ==Authenticating to Kubernetes== We will authorize (authz) you to use the cluster on the server side, but you will also need to authenticate (authn) using your '@ucsc.edu' email address and a unique Java Web token. These credentials are installed in ~/.kube/config in whatever machine you are coming from to get to the cluster. To authenticate and get your base kubernetes configuration, go to this URL (below), which will ask you to authenticate to Google. Use your '@ucsc.edu' email address as the login. It will then ask you to authenticate via CruzID Gold if your web browser doesn't already have the authentication token cached: https://cg-kube-auth.gi.ucsc.edu Once you authenticate (via username/password and 2-factor auth for CruzID Gold), it will pass you back to the 'https://cg-kube-auth.gi.ucsc.edu' website and it should confirm authentication on the top with a message saying "Successfully Authenticated". '''If you see any errors in red,''' but are sure you typed in your password and 2-factor auth correctly, click on the above link again (https://cg-kube-auth.gi.ucsc.edu) and authenticate a second time, which should work. There is a quirk where the web token doesn't always pass back to us correctly on the first try. Upon success, you will be able to click the blue "Download Config File" button, which contains your initial kubernetes config file. Copy this file to your home directory as ~/.kube/config. Follow the directions on the web page to insert your '''"namespace:"''' line as directed. We will let you know which namespace to use. ==Testing Connectivity== Once your ~/.kube/config file is set up correctly, you should be able to connect to the cluster. All our shared servers here at the Genomics Institute have the 'kubectl' command installed on them, but if you are coming from somewhere else, just make sure the "kubectl" utility is installed on that machine. A quick test should go as follows: $ kubectl get nodes NAME STATUS ROLES AGE VERSION k1.kube Ready <none> 13h v1.15.3 k2.kube Ready <none> 13h v1.15.3 k3.kube Ready <none> 13h v1.15.3 master.kube Ready master 13h v1.15.3 ==Running Pods and Jobs with Requests and Limits== When running jobs and pods on kubernetes, you will always want to specify "requests" and "limits" on resources, otherwise your pods will get stuck with the default limits which are tiny (to protect against runaway pods). You should always have an idea of how much resources will be consumed by your jobs, and not use much more than that, in order not to hog all the resources. It also prevents your job from "running away" unexpectedly and chewing up more resources than expected. Here is a good example of a job file that specifies limits: job.yml apiVersion: batch/v1 kind: Job metadata: name: $USER-$TS spec: backoffLimit: 0 ttlSecondsAfterFinished: 30 template: spec: containers: - name: magic image: robcurrie/ubuntu imagePullPolicy: Always resources: requests: cpu: "1" memory: "2G" ephemeral-storage: "2G" limits: cpu: "1" memory: "2G" ephemeral-storage: "3G" command: ["/bin/bash", "-c"] args: ['for i in {1..100}; do echo "$i: $(date)"; i=$((i+1)); sleep 1; done'] restartPolicy: Never priorityClassName: medium-priority Please note that the "request" and "limit" item fields should be the same. You would think that you could set the limit higher than the request, but in reality they need to match in order for the pod to stay within the kubernetes resource limit bubble. If you set the limit higher than the request, then you are risking the pod using more memory than the scheduler expects, and the node can start killing off random other colocated pods by way of OOM, which is very, very bad for the cluster. Also note the "priorityClassName" line. Available values are: high-priority medium-priority low-priority That affects how quickly your jobs move up the queue in the event there are a lot of queued jobs. Always use "medium-priority" as the default unless you specifically know you need it higher or lower. Higher priority jobs will always go in front of lower priority jobs. '''NOTE:''' Jobs and pods that have '''completed''' over 72 hours ago but have not been cleaned up will be automatically removed by the garbage collector. Most jobs will have the "ttlSecondsAfterFinished" configuration item in them, so they will automatically cleaned up after that time expires, but leaving the old pods and jobs around pins the disk space they were using while they remain, so it's good to get rid of them as soon as they are done unless you are debugging a failure or something like that. Jobs that '''run''' over 72 hours will not be deleted, only the ones that have '''exited''' over 72 hours ago. A lot of other good information can be viewed on Rob Currie's github page, which includes examples and some "How To" documentation: https://github.com/rcurrie/kubernetes ==View the Cluster's Current Activity== One quick way to check the cluster's utilization is to do: kubectl top nodes NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% k1.kube 1815m 1% 1191Mi 0% k2.kube 51837m 53% 46507Mi 12% k3.kube 1458m 1% 61270Mi 15% master.kube 111m 5% 1024Mi 46% That means the worker nodes, k1, k2 and k3, are using minimal memory, k2 is using 52% CPU but lots of room still open for new jobs. Ignore the master node as that one only handles cluster management and doesn't run jobs or pods for users. Another good way to get a lot of details about the current state of the cluster is through the Kubernetes Dashboard: https://cgl-k8s-dashboard.gi.ucsc.edu/ Select the "token" login method, and paste in this (long) token: eyJhbGciOiJSUzI1NiIsImtpZCI6IiJ9.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJrdWJlcm5ldGVzLWRhc2hib2FyZCIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJrdWJlcm5ldGVzLWRhc2hib2FyZC10b2tlbi0ycDY4cCIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50Lm5hbWUiOiJrdWJlcm5ldGVzLWRhc2hib2FyZCIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50LnVpZCI6ImE5ZGI2Y2I0LWMyYWYtNDM3My04ZmM2LWE4YWYwYTBmNGRkNCIsInN1YiI6InN5c3RlbTpzZXJ2aWNlYWNjb3VudDprdWJlcm5ldGVzLWRhc2hib2FyZDprdWJlcm5ldGVzLWRhc2hib2FyZCJ9.ZVQjryG_ksfvReIfq4Frb6M4sE6OVDOXnFy9Aii-h3mrpdHRE6bgjdAvSGZ0jJSIUEz5GgPBQ0lCwhyZocivHHr4zTrNxMkOFZhPDnpvF6RVIDWTkqmH9Dg6qmro0gTJP75oKBpt7dFN2pW4zvqOAzqPmh7qxfoVusN8X6U13YirMFEf65-aGL-_FFNBsEzvjkC-BgXWbtk3YZc8CJL7xtvlKLyE6u6jC9Qx0SWnwzkALlxmzo_yYTDKpIrWiQGEqzLQOxKml-H0kSYLDX-t4sTivXp4vCw_ruoqwIpLnnQAC7q3ZtSTxHIrxbB7n_M8gfhpXtwprbPav-XmBk1xaQ The dashboard is read-only, so you won't be able to edit anything, it's mostly for seeing what's going on and where. You can also take a look at current resource consumption by taking a look at our Ganglia Cluster monitor tool: https://ganglia.gi.ucsc.edu/ That website requires a username and password: username: genecats password: KiloKluster That's mostly for keeping the scrip kiddies and bots from banging on it. Once you get in, you should see a drop-down menu near the top left of the screen near "Genomics Institute Grid". From the drop-down menu, select "CG Kubernetes Cluster". It will take you to a page detailing the current resource usage and activity on the nodes. This can be useful for see if anyone else is using the whole cluster, or just to get an idea of how many resources are available for your batch of jobs to assign to the cluster. 6120c01ff0535d179f5078d8447ffb35b12597f0 178 177 2020-03-13T22:15:58Z Anovak 4 Note that limits can replace requests if unspecified. wikitext text/x-wiki __TOC__ The Computational Genomics Group has a Kubernetes Cluster running on several large instances in AWS. The current cluster makeup includes three worker nodes, each with the following specs: * 96 CPU cores (3.1 GHz) * 384 GB RAM * 3.3 TB Local NVMe Flash Storage * 25 Gb/s Network Interface ==Getting Authorized to Connect== If you require access to this kubernetes cluster, contact Benedict Paten asking for permission to use it, then pass on that permission via email to: cluster-admin@soe.ucsc.edu Let us know which group you are with and we can authorize you to use the cluster in the correct namespace. ==Authenticating to Kubernetes== We will authorize (authz) you to use the cluster on the server side, but you will also need to authenticate (authn) using your '@ucsc.edu' email address and a unique Java Web token. These credentials are installed in ~/.kube/config in whatever machine you are coming from to get to the cluster. To authenticate and get your base kubernetes configuration, go to this URL (below), which will ask you to authenticate to Google. Use your '@ucsc.edu' email address as the login. It will then ask you to authenticate via CruzID Gold if your web browser doesn't already have the authentication token cached: https://cg-kube-auth.gi.ucsc.edu Once you authenticate (via username/password and 2-factor auth for CruzID Gold), it will pass you back to the 'https://cg-kube-auth.gi.ucsc.edu' website and it should confirm authentication on the top with a message saying "Successfully Authenticated". '''If you see any errors in red,''' but are sure you typed in your password and 2-factor auth correctly, click on the above link again (https://cg-kube-auth.gi.ucsc.edu) and authenticate a second time, which should work. There is a quirk where the web token doesn't always pass back to us correctly on the first try. Upon success, you will be able to click the blue "Download Config File" button, which contains your initial kubernetes config file. Copy this file to your home directory as ~/.kube/config. Follow the directions on the web page to insert your '''"namespace:"''' line as directed. We will let you know which namespace to use. ==Testing Connectivity== Once your ~/.kube/config file is set up correctly, you should be able to connect to the cluster. All our shared servers here at the Genomics Institute have the 'kubectl' command installed on them, but if you are coming from somewhere else, just make sure the "kubectl" utility is installed on that machine. A quick test should go as follows: $ kubectl get nodes NAME STATUS ROLES AGE VERSION k1.kube Ready <none> 13h v1.15.3 k2.kube Ready <none> 13h v1.15.3 k3.kube Ready <none> 13h v1.15.3 master.kube Ready master 13h v1.15.3 ==Running Pods and Jobs with Requests and Limits== When running jobs and pods on kubernetes, you will always want to specify "requests" and "limits" on resources, otherwise your pods will get stuck with the default limits which are tiny (to protect against runaway pods). You should always have an idea of how much resources will be consumed by your jobs, and not use much more than that, in order not to hog all the resources. It also prevents your job from "running away" unexpectedly and chewing up more resources than expected. Here is a good example of a job file that specifies limits: job.yml apiVersion: batch/v1 kind: Job metadata: name: $USER-$TS spec: backoffLimit: 0 ttlSecondsAfterFinished: 30 template: spec: containers: - name: magic image: robcurrie/ubuntu imagePullPolicy: Always resources: requests: cpu: "1" memory: "2G" ephemeral-storage: "2G" limits: cpu: "1" memory: "2G" ephemeral-storage: "3G" command: ["/bin/bash", "-c"] args: ['for i in {1..100}; do echo "$i: $(date)"; i=$((i+1)); sleep 1; done'] restartPolicy: Never priorityClassName: medium-priority Please note that the "request" and "limit" item fields should be the same. You would think that you could set the limit higher than the request, but in reality they need to match in order for the pod to stay within the kubernetes resource limit bubble. If you set the limit higher than the request, then you are risking the pod using more memory than the scheduler expects, and the node can start killing off random other colocated pods by way of OOM, which is very, very bad for the cluster. If you omit the "requests" section altogether, the limit values will be used, so if you use only one, use "limits". Also note the "priorityClassName" line. Available values are: high-priority medium-priority low-priority That affects how quickly your jobs move up the queue in the event there are a lot of queued jobs. Always use "medium-priority" as the default unless you specifically know you need it higher or lower. Higher priority jobs will always go in front of lower priority jobs. '''NOTE:''' Jobs and pods that have '''completed''' over 72 hours ago but have not been cleaned up will be automatically removed by the garbage collector. Most jobs will have the "ttlSecondsAfterFinished" configuration item in them, so they will automatically cleaned up after that time expires, but leaving the old pods and jobs around pins the disk space they were using while they remain, so it's good to get rid of them as soon as they are done unless you are debugging a failure or something like that. Jobs that '''run''' over 72 hours will not be deleted, only the ones that have '''exited''' over 72 hours ago. A lot of other good information can be viewed on Rob Currie's github page, which includes examples and some "How To" documentation: https://github.com/rcurrie/kubernetes ==View the Cluster's Current Activity== One quick way to check the cluster's utilization is to do: kubectl top nodes NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% k1.kube 1815m 1% 1191Mi 0% k2.kube 51837m 53% 46507Mi 12% k3.kube 1458m 1% 61270Mi 15% master.kube 111m 5% 1024Mi 46% That means the worker nodes, k1, k2 and k3, are using minimal memory, k2 is using 52% CPU but lots of room still open for new jobs. Ignore the master node as that one only handles cluster management and doesn't run jobs or pods for users. Another good way to get a lot of details about the current state of the cluster is through the Kubernetes Dashboard: https://cgl-k8s-dashboard.gi.ucsc.edu/ Select the "token" login method, and paste in this (long) token: eyJhbGciOiJSUzI1NiIsImtpZCI6IiJ9.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJrdWJlcm5ldGVzLWRhc2hib2FyZCIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJrdWJlcm5ldGVzLWRhc2hib2FyZC10b2tlbi0ycDY4cCIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50Lm5hbWUiOiJrdWJlcm5ldGVzLWRhc2hib2FyZCIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50LnVpZCI6ImE5ZGI2Y2I0LWMyYWYtNDM3My04ZmM2LWE4YWYwYTBmNGRkNCIsInN1YiI6InN5c3RlbTpzZXJ2aWNlYWNjb3VudDprdWJlcm5ldGVzLWRhc2hib2FyZDprdWJlcm5ldGVzLWRhc2hib2FyZCJ9.ZVQjryG_ksfvReIfq4Frb6M4sE6OVDOXnFy9Aii-h3mrpdHRE6bgjdAvSGZ0jJSIUEz5GgPBQ0lCwhyZocivHHr4zTrNxMkOFZhPDnpvF6RVIDWTkqmH9Dg6qmro0gTJP75oKBpt7dFN2pW4zvqOAzqPmh7qxfoVusN8X6U13YirMFEf65-aGL-_FFNBsEzvjkC-BgXWbtk3YZc8CJL7xtvlKLyE6u6jC9Qx0SWnwzkALlxmzo_yYTDKpIrWiQGEqzLQOxKml-H0kSYLDX-t4sTivXp4vCw_ruoqwIpLnnQAC7q3ZtSTxHIrxbB7n_M8gfhpXtwprbPav-XmBk1xaQ The dashboard is read-only, so you won't be able to edit anything, it's mostly for seeing what's going on and where. You can also take a look at current resource consumption by taking a look at our Ganglia Cluster monitor tool: https://ganglia.gi.ucsc.edu/ That website requires a username and password: username: genecats password: KiloKluster That's mostly for keeping the scrip kiddies and bots from banging on it. Once you get in, you should see a drop-down menu near the top left of the screen near "Genomics Institute Grid". From the drop-down menu, select "CG Kubernetes Cluster". It will take you to a page detailing the current resource usage and activity on the nodes. This can be useful for see if anyone else is using the whole cluster, or just to get an idea of how many resources are available for your batch of jobs to assign to the cluster. 06d1d9162c5d6b73723bb5eb2644e5cf0113d333 179 178 2020-03-13T22:27:16Z Anovak 4 wikitext text/x-wiki __TOC__ The Computational Genomics Group has a Kubernetes Cluster running on several large instances in AWS. The current cluster makeup includes three worker nodes, each with the following specs: * 96 CPU cores (3.1 GHz) * 384 GB RAM * 3.3 TB Local NVMe Flash Storage * 25 Gb/s Network Interface ==Getting Authorized to Connect== If you require access to this kubernetes cluster, contact Benedict Paten asking for permission to use it, then pass on that permission via email to: cluster-admin@soe.ucsc.edu Let us know which group you are with and we can authorize you to use the cluster in the correct namespace. ==Authenticating to Kubernetes== We will authorize (authz) you to use the cluster on the server side, but you will also need to authenticate (authn) using your '@ucsc.edu' email address and a unique Java Web token. These credentials are installed in ~/.kube/config in whatever machine you are coming from to get to the cluster. To authenticate and get your base kubernetes configuration, go to this URL (below), which will ask you to authenticate to Google. Use your '@ucsc.edu' email address as the login. It will then ask you to authenticate via CruzID Gold if your web browser doesn't already have the authentication token cached: https://cg-kube-auth.gi.ucsc.edu Once you authenticate (via username/password and 2-factor auth for CruzID Gold), it will pass you back to the 'https://cg-kube-auth.gi.ucsc.edu' website and it should confirm authentication on the top with a message saying "Successfully Authenticated". '''If you see any errors in red,''' but are sure you typed in your password and 2-factor auth correctly, click on the above link again (https://cg-kube-auth.gi.ucsc.edu) and authenticate a second time, which should work. There is a quirk where the web token doesn't always pass back to us correctly on the first try. Upon success, you will be able to click the blue "Download Config File" button, which contains your initial kubernetes config file. Copy this file to your home directory as ~/.kube/config. Follow the directions on the web page to insert your '''"namespace:"''' line as directed. We will let you know which namespace to use. ==Testing Connectivity== Once your ~/.kube/config file is set up correctly, you should be able to connect to the cluster. All our shared servers here at the Genomics Institute have the 'kubectl' command installed on them, but if you are coming from somewhere else, just make sure the "kubectl" utility is installed on that machine. A quick test should go as follows: $ kubectl get nodes NAME STATUS ROLES AGE VERSION k1.kube Ready <none> 13h v1.15.3 k2.kube Ready <none> 13h v1.15.3 k3.kube Ready <none> 13h v1.15.3 master.kube Ready master 13h v1.15.3 ==Running Pods and Jobs with Requests and Limits== When running jobs and pods on kubernetes, you will always want to specify "requests" and "limits" on resources, otherwise your pods will get stuck with the default limits which are tiny (to protect against runaway pods). You should always have an idea of how much resources will be consumed by your jobs, and not use much more than that, in order not to hog all the resources. It also prevents your job from "running away" unexpectedly and chewing up more resources than expected. Here is a good example of a job file that specifies limits: job.yml apiVersion: batch/v1 kind: Job metadata: name: $USER-$TS spec: backoffLimit: 0 ttlSecondsAfterFinished: 30 template: spec: containers: - name: magic image: robcurrie/ubuntu imagePullPolicy: Always resources: requests: cpu: "1" memory: "2G" ephemeral-storage: "2G" limits: cpu: "1" memory: "2G" ephemeral-storage: "3G" command: ["/bin/bash", "-c"] args: ['for i in {1..100}; do echo "$i: $(date)"; i=$((i+1)); sleep 1; done'] restartPolicy: Never priorityClassName: medium-priority Please note that the "request" and "limit" item fields should be the same. You would think that you could set the limit higher than the request, but in reality they need to match in order for the pod to stay within the kubernetes resource limit bubble. If you set the limit higher than the request, then you are risking the pod using more memory than the scheduler expects, and the node can start killing off random other colocated pods by way of OOM, which is very, very bad for the cluster. If you omit the "requests" section altogether, the limit values will be used, so if you use only one, use "limits". Also note the "priorityClassName" line. Available values are: high-priority medium-priority low-priority That affects how quickly your jobs move up the queue in the event there are a lot of queued jobs. Always use "medium-priority" as the default unless you specifically know you need it higher or lower. Higher priority jobs will always go in front of lower priority jobs. '''NOTE:''' Jobs and pods that have '''completed''' over 72 hours ago but have not been cleaned up will be automatically removed by the garbage collector. Most jobs will have the "ttlSecondsAfterFinished" configuration item in them, so they will automatically cleaned up after that time expires, but leaving the old pods and jobs around pins the disk space they were using while they remain, so it's good to get rid of them as soon as they are done unless you are debugging a failure or something like that. Jobs that '''run''' over 72 hours will not be deleted, only the ones that have '''exited''' over 72 hours ago. A lot of other good information can be viewed on Rob Currie's github page, which includes examples and some "How To" documentation: https://github.com/rcurrie/kubernetes == Inlining Jobs in Shell and Shell in Jobs == When interactively developing on Kubernetes, it can be useful to be able to have a shell command you can copy and paste to run a Kubernetes job, rather than having to create YAML files on disk. Similarly, it can be useful to have shell scripting inline in your Kubernetes job definitions, rather than having to bake your experimental script into a Docker container. Here's an example that does both, putting the YAML inside a heredoc and putting the script to run in the container inside a multiline YAML string. We precede this with a command to delete the job, so you can modify your script and re-paste it to replace a failed or failing job. We also make sure to mount the AWS credentials in the container, so that the ''aws'' command will be able to access S3 if you install it. kubectl delete job username-job kubectl apply -f - <<'EOF' apiVersion: batch/v1 kind: Job metadata: name: username-job spec: ttlSecondsAfterFinished: 1000 template: spec: containers: - name: main imagePullPolicy: Always image: ubuntu:18.04 command: - /bin/bash - -c - | set -e DEBIAN_FRONTEND=noninteractive apt-get update DEBIAN_FRONTEND=noninteractive apt-get install -y awscli cowsay cowsay "Listing files" aws s3 ls s3://vg-k8s/ volumeMounts: - mountPath: /tmp name: scratch-volume - mountPath: /root/.aws name: s3-credentials resources: limits: cpu: 1 memory: "4Gi" ephemeral-storage: "10Gi" restartPolicy: Never volumes: - name: scratch-volume emptyDir: {} - name: s3-credentials secret: secretName: shared-s3-credentials backoffLimit: 0 EOF Make sure to replace "username-job" with a unique job name that includes ''your'' username. ==View the Cluster's Current Activity== One quick way to check the cluster's utilization is to do: kubectl top nodes NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% k1.kube 1815m 1% 1191Mi 0% k2.kube 51837m 53% 46507Mi 12% k3.kube 1458m 1% 61270Mi 15% master.kube 111m 5% 1024Mi 46% That means the worker nodes, k1, k2 and k3, are using minimal memory, k2 is using 52% CPU but lots of room still open for new jobs. Ignore the master node as that one only handles cluster management and doesn't run jobs or pods for users. Another good way to get a lot of details about the current state of the cluster is through the Kubernetes Dashboard: https://cgl-k8s-dashboard.gi.ucsc.edu/ Select the "token" login method, and paste in this (long) token: eyJhbGciOiJSUzI1NiIsImtpZCI6IiJ9.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJrdWJlcm5ldGVzLWRhc2hib2FyZCIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJrdWJlcm5ldGVzLWRhc2hib2FyZC10b2tlbi0ycDY4cCIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50Lm5hbWUiOiJrdWJlcm5ldGVzLWRhc2hib2FyZCIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50LnVpZCI6ImE5ZGI2Y2I0LWMyYWYtNDM3My04ZmM2LWE4YWYwYTBmNGRkNCIsInN1YiI6InN5c3RlbTpzZXJ2aWNlYWNjb3VudDprdWJlcm5ldGVzLWRhc2hib2FyZDprdWJlcm5ldGVzLWRhc2hib2FyZCJ9.ZVQjryG_ksfvReIfq4Frb6M4sE6OVDOXnFy9Aii-h3mrpdHRE6bgjdAvSGZ0jJSIUEz5GgPBQ0lCwhyZocivHHr4zTrNxMkOFZhPDnpvF6RVIDWTkqmH9Dg6qmro0gTJP75oKBpt7dFN2pW4zvqOAzqPmh7qxfoVusN8X6U13YirMFEf65-aGL-_FFNBsEzvjkC-BgXWbtk3YZc8CJL7xtvlKLyE6u6jC9Qx0SWnwzkALlxmzo_yYTDKpIrWiQGEqzLQOxKml-H0kSYLDX-t4sTivXp4vCw_ruoqwIpLnnQAC7q3ZtSTxHIrxbB7n_M8gfhpXtwprbPav-XmBk1xaQ The dashboard is read-only, so you won't be able to edit anything, it's mostly for seeing what's going on and where. You can also take a look at current resource consumption by taking a look at our Ganglia Cluster monitor tool: https://ganglia.gi.ucsc.edu/ That website requires a username and password: username: genecats password: KiloKluster That's mostly for keeping the scrip kiddies and bots from banging on it. Once you get in, you should see a drop-down menu near the top left of the screen near "Genomics Institute Grid". From the drop-down menu, select "CG Kubernetes Cluster". It will take you to a page detailing the current resource usage and activity on the nodes. This can be useful for see if anyone else is using the whole cluster, or just to get an idea of how many resources are available for your batch of jobs to assign to the cluster. 75c0e69be43a28e0f88ad02b2d5bd309cc6132c5 180 179 2020-03-13T22:49:05Z Anovak 4 Talk about perf and NUMA issues when profiling wikitext text/x-wiki __TOC__ The Computational Genomics Group has a Kubernetes Cluster running on several large instances in AWS. The current cluster makeup includes three worker nodes, each with the following specs: * 96 CPU cores (3.1 GHz) * 384 GB RAM * 3.3 TB Local NVMe Flash Storage * 25 Gb/s Network Interface ==Getting Authorized to Connect== If you require access to this kubernetes cluster, contact Benedict Paten asking for permission to use it, then pass on that permission via email to: cluster-admin@soe.ucsc.edu Let us know which group you are with and we can authorize you to use the cluster in the correct namespace. ==Authenticating to Kubernetes== We will authorize (authz) you to use the cluster on the server side, but you will also need to authenticate (authn) using your '@ucsc.edu' email address and a unique Java Web token. These credentials are installed in ~/.kube/config in whatever machine you are coming from to get to the cluster. To authenticate and get your base kubernetes configuration, go to this URL (below), which will ask you to authenticate to Google. Use your '@ucsc.edu' email address as the login. It will then ask you to authenticate via CruzID Gold if your web browser doesn't already have the authentication token cached: https://cg-kube-auth.gi.ucsc.edu Once you authenticate (via username/password and 2-factor auth for CruzID Gold), it will pass you back to the 'https://cg-kube-auth.gi.ucsc.edu' website and it should confirm authentication on the top with a message saying "Successfully Authenticated". '''If you see any errors in red,''' but are sure you typed in your password and 2-factor auth correctly, click on the above link again (https://cg-kube-auth.gi.ucsc.edu) and authenticate a second time, which should work. There is a quirk where the web token doesn't always pass back to us correctly on the first try. Upon success, you will be able to click the blue "Download Config File" button, which contains your initial kubernetes config file. Copy this file to your home directory as ~/.kube/config. Follow the directions on the web page to insert your '''"namespace:"''' line as directed. We will let you know which namespace to use. ==Testing Connectivity== Once your ~/.kube/config file is set up correctly, you should be able to connect to the cluster. All our shared servers here at the Genomics Institute have the 'kubectl' command installed on them, but if you are coming from somewhere else, just make sure the "kubectl" utility is installed on that machine. A quick test should go as follows: $ kubectl get nodes NAME STATUS ROLES AGE VERSION k1.kube Ready <none> 13h v1.15.3 k2.kube Ready <none> 13h v1.15.3 k3.kube Ready <none> 13h v1.15.3 master.kube Ready master 13h v1.15.3 ==Running Pods and Jobs with Requests and Limits== When running jobs and pods on kubernetes, you will always want to specify "requests" and "limits" on resources, otherwise your pods will get stuck with the default limits which are tiny (to protect against runaway pods). You should always have an idea of how much resources will be consumed by your jobs, and not use much more than that, in order not to hog all the resources. It also prevents your job from "running away" unexpectedly and chewing up more resources than expected. Here is a good example of a job file that specifies limits: job.yml apiVersion: batch/v1 kind: Job metadata: name: $USER-$TS spec: backoffLimit: 0 ttlSecondsAfterFinished: 30 template: spec: containers: - name: magic image: robcurrie/ubuntu imagePullPolicy: Always resources: requests: cpu: "1" memory: "2G" ephemeral-storage: "2G" limits: cpu: "1" memory: "2G" ephemeral-storage: "3G" command: ["/bin/bash", "-c"] args: ['for i in {1..100}; do echo "$i: $(date)"; i=$((i+1)); sleep 1; done'] restartPolicy: Never priorityClassName: medium-priority Please note that the "request" and "limit" item fields should be the same. You would think that you could set the limit higher than the request, but in reality they need to match in order for the pod to stay within the kubernetes resource limit bubble. If you set the limit higher than the request, then you are risking the pod using more memory than the scheduler expects, and the node can start killing off random other colocated pods by way of OOM, which is very, very bad for the cluster. If you omit the "requests" section altogether, the limit values will be used, so if you use only one, use "limits". Also note the "priorityClassName" line. Available values are: high-priority medium-priority low-priority That affects how quickly your jobs move up the queue in the event there are a lot of queued jobs. Always use "medium-priority" as the default unless you specifically know you need it higher or lower. Higher priority jobs will always go in front of lower priority jobs. '''NOTE:''' Jobs and pods that have '''completed''' over 72 hours ago but have not been cleaned up will be automatically removed by the garbage collector. Most jobs will have the "ttlSecondsAfterFinished" configuration item in them, so they will automatically cleaned up after that time expires, but leaving the old pods and jobs around pins the disk space they were using while they remain, so it's good to get rid of them as soon as they are done unless you are debugging a failure or something like that. Jobs that '''run''' over 72 hours will not be deleted, only the ones that have '''exited''' over 72 hours ago. A lot of other good information can be viewed on Rob Currie's github page, which includes examples and some "How To" documentation: https://github.com/rcurrie/kubernetes ==Inlining Jobs in Shell and Shell in Jobs== When interactively developing on Kubernetes, it can be useful to be able to have a shell command you can copy and paste to run a Kubernetes job, rather than having to create YAML files on disk. Similarly, it can be useful to have shell scripting inline in your Kubernetes job definitions, rather than having to bake your experimental script into a Docker container. Here's an example that does both, putting the YAML inside a heredoc and putting the script to run in the container inside a multiline YAML string. We precede this with a command to delete the job, so you can modify your script and re-paste it to replace a failed or failing job. We also make sure to mount the AWS credentials in the container, so that the ''aws'' command will be able to access S3 if you install it. kubectl delete job username-job kubectl apply -f - <<'EOF' apiVersion: batch/v1 kind: Job metadata: name: username-job spec: ttlSecondsAfterFinished: 1000 template: spec: containers: - name: main imagePullPolicy: Always image: ubuntu:18.04 command: - /bin/bash - -c - | set -e DEBIAN_FRONTEND=noninteractive apt-get update DEBIAN_FRONTEND=noninteractive apt-get install -y awscli cowsay cowsay "Listing files" aws s3 ls s3://vg-k8s/ volumeMounts: - mountPath: /tmp name: scratch-volume - mountPath: /root/.aws name: s3-credentials resources: limits: cpu: 1 memory: "4Gi" ephemeral-storage: "10Gi" restartPolicy: Never volumes: - name: scratch-volume emptyDir: {} - name: s3-credentials secret: secretName: shared-s3-credentials backoffLimit: 0 EOF Make sure to replace "username-job" with a unique job name that includes ''your'' username. ==View the Cluster's Current Activity== One quick way to check the cluster's utilization is to do: kubectl top nodes NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% k1.kube 1815m 1% 1191Mi 0% k2.kube 51837m 53% 46507Mi 12% k3.kube 1458m 1% 61270Mi 15% master.kube 111m 5% 1024Mi 46% That means the worker nodes, k1, k2 and k3, are using minimal memory, k2 is using 52% CPU but lots of room still open for new jobs. Ignore the master node as that one only handles cluster management and doesn't run jobs or pods for users. Another good way to get a lot of details about the current state of the cluster is through the Kubernetes Dashboard: https://cgl-k8s-dashboard.gi.ucsc.edu/ Select the "token" login method, and paste in this (long) token: eyJhbGciOiJSUzI1NiIsImtpZCI6IiJ9.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJrdWJlcm5ldGVzLWRhc2hib2FyZCIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJrdWJlcm5ldGVzLWRhc2hib2FyZC10b2tlbi0ycDY4cCIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50Lm5hbWUiOiJrdWJlcm5ldGVzLWRhc2hib2FyZCIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50LnVpZCI6ImE5ZGI2Y2I0LWMyYWYtNDM3My04ZmM2LWE4YWYwYTBmNGRkNCIsInN1YiI6InN5c3RlbTpzZXJ2aWNlYWNjb3VudDprdWJlcm5ldGVzLWRhc2hib2FyZDprdWJlcm5ldGVzLWRhc2hib2FyZCJ9.ZVQjryG_ksfvReIfq4Frb6M4sE6OVDOXnFy9Aii-h3mrpdHRE6bgjdAvSGZ0jJSIUEz5GgPBQ0lCwhyZocivHHr4zTrNxMkOFZhPDnpvF6RVIDWTkqmH9Dg6qmro0gTJP75oKBpt7dFN2pW4zvqOAzqPmh7qxfoVusN8X6U13YirMFEf65-aGL-_FFNBsEzvjkC-BgXWbtk3YZc8CJL7xtvlKLyE6u6jC9Qx0SWnwzkALlxmzo_yYTDKpIrWiQGEqzLQOxKml-H0kSYLDX-t4sTivXp4vCw_ruoqwIpLnnQAC7q3ZtSTxHIrxbB7n_M8gfhpXtwprbPav-XmBk1xaQ The dashboard is read-only, so you won't be able to edit anything, it's mostly for seeing what's going on and where. You can also take a look at current resource consumption by taking a look at our Ganglia Cluster monitor tool: https://ganglia.gi.ucsc.edu/ That website requires a username and password: username: genecats password: KiloKluster That's mostly for keeping the scrip kiddies and bots from banging on it. Once you get in, you should see a drop-down menu near the top left of the screen near "Genomics Institute Grid". From the drop-down menu, select "CG Kubernetes Cluster". It will take you to a page detailing the current resource usage and activity on the nodes. This can be useful for see if anyone else is using the whole cluster, or just to get an idea of how many resources are available for your batch of jobs to assign to the cluster. ==Profiling with Perf== You can use Linux's "perf" to profile your code on the Kubernetes cluster. Here is an example of a job that does so. You need to obtain a "perf" binary that matches the version of the kernel that the Kubernetes ''hosts'' are running, which most likely does not correspond to any version of "perf" available in the Ubuntu repositories. Here we download a binary previously uploaded to S3. Also, the Kubernetes hosts have '''Non-Uniform Memory Access (NUMA)''': some physical memory is "closer" to some physical cores than to toher physical cores. The system is divided into '''NUMA nodes''', each containing some cores and some memory. Memory access from a node to its own memory is significantly faster than memory access from a node to other nodes' memory. For consistent profiling, it is important to restrict your application to a single NUMA node if possible, with "numactl", so that all accesses are local to the NUMA node. If you don't do this, your application performance will vary arbitrarily depending on whether and when threads are scheduled on the different NUMA nodes of the system. apiVersion: batch/v1 kind: Job metadata: name: username-profiling spec: ttlSecondsAfterFinished: 1000 template: metadata: # Apply a lable saying that we use NUMA node 0 labels: usesnuma0: "Yes" spec: affinity: # Say that we should not schedule on the same node as any other pod with that label podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: usesnuma0 operator: In values: - "Yes" topologyKey: "kubernetes.io/hostname" containers: - name: main imagePullPolicy: Always image: ubuntu:18.04 command: - /bin/bash - -c - | set -e DEBIAN_FRONTEND=noninteractive apt-get update DEBIAN_FRONTEND=noninteractive apt-get install -y awscli numactl # Use this particular perf binary that matches the hosts' kernels # If it is missing or outdated, get a new one from Erich or cluster-admin aws s3 cp s3://vg-k8s/users/adamnovak/projects/test/perf /usr/bin/perf chmod +x /usr/bin/perf # Do your work with perf here. # Use numactl to limit your code to NUMA node 0 for consistent memory access times volumeMounts: - mountPath: /tmp name: scratch-volume - mountPath: /root/.aws name: s3-credentials resources: limits: cpu: 24 # One NUMA node on our machines is 24 cores. memory: "150Gi" ephemeral-storage: "400Gi" restartPolicy: Never volumes: - name: scratch-volume emptyDir: {} - name: s3-credentials secret: secretName: shared-s3-credentials backoffLimit: 0 9a63752600bf3e89761ca54c13605b6b2b0b9e01 Overview of Getting and Using an AWS IAM Account 0 21 157 147 2019-09-19T20:29:00Z Weiler 3 /* Tag Your Resources */ wikitext text/x-wiki __TOC__ == Getting AWS (Amazon Web Services) Access == The Genomics Institute has a series of AWS Accounts that all support different projects. Often if you become associated with one or more of those projects, you will need access to that account or accounts. The way we are managing AWS IAM Account Access is that we have one AWS account that is the 'top level' account that everyone gets access to, and then, once you log in there, you can "Switch Role" into another sub-account that you are running things in. To get access, you will need your PI or Project Manager to email cluster-admin (cluster-admin@soe.ucsc.edu) asking for an AWS account for you, and also in that email to name the projects you will have access to. The cluster-admin group will contact you with your credentials to login. Once you login, you can change your password if you want to and also you will be able to set up MFA (Multi Factor Authentication) for your account. You will be required to use MFA in order to "Switch Role" into any of the sub-accounts for the projects you are working on. The login URL to use when logging in to the top level account is listed below. The top level account is known as "gi-gateway": https://gi-gateway.signin.aws.amazon.com/console When you login, you '''may''' see a couple error messages on the AWS dashboard saying you don't have access to view certain resources - '''this is normal''', so just ignore the error messages. == Configuring Account Credentials == Once you login to the gi-gateway, you will have very few permissions to do anything there - which is normal, since you will not be working in that account anyway. The gi-gateway account is just there to authenticate you to AWS. '''Changing Your Password''' You can change your password by clicking on your username on the top right of the web browser window, just to the right of the little bell. If your username is bill@ucsc.edu, for example: * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window. * Click the '''"My Security Credentials"''' drop-down menu option. * Click the '''"Change Password"''' button to change your password. Note that we have a password strength policy in place, so your password must conform to the following requirements: * Your password must be at least 10 characters long * Your password must contain at least one lowercase letter * Your password must contain at least one non-alphanumeric character * Your password must contain at least one number You will also need to configure MFA on your account before you will be allowed to switch roles into another account. '''Configuring MFA''' To configure MFA (Multi Factor Authentication), the most common way to do it is to use '''Google Authenticator''', which is an app available for Apple and Android based cell phones and mobile devices. The app is free, simply download it from the app store to your cell phone or tablet to get started. Other MFA apps may also work but we have not tested everything out there. Once you have Google Authenticator installed, log into the gi-gateway account using the above URL, then: * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window (again, bill@ucsc.edu is an example). * Click the '''"My Security Credentials"''' drop-down menu option. * Scroll down to the MFA (Multi-Factor Authentication) section of the page, and click '''"Assign MFA Device"'''. * In the following menu select '''"Virtual MFA Device"'''. * In the following window click the '''"Show QR Code"''' link, and the MFA QR barcode will appear on your screen. * Open the Google Authenticator app on your mobile device, and click the little "+" symbol in the top right corner of the app to add an account. * You will then need to select "Scan Barcode" in the Google Authenticaor app to continue, and aim your mobile device camera at the QR barcode. * The new account MFA device should then be set up and you should see a 6 digit number with a small timer to the right of it. You must type in one 6 digit code that it displays into your web browser when asked, then wait for the next code to appear after the timer expires, and type that into the second field. It should then inform you that you have successfully associated an MFA device with your account. Once you have associated an MFA device with the 'gi-gateway' Account, '''log out''', then log back in. It will ask for your username and password, and then ask for your MFA code, which you can view by opening Google Authenticator and seeing what code it is displaying at that time. The code changes every 30 seconds or so. '''You must log out first and log back in using MFA in order to be able to switch roles!!!''' == Switching Roles into Another AWS Account == Now that you have configured a password and enabled MFA, you will be allowed to "Switch Roles" into another account such that you can begin work there. The first time you switch roles into an account it will ask you a few questions, but subsequently it will remember which roles you have access to and they will become a menu item you can click on to quickly switch roles. Let's assume that you want to switch to the 'pangenomics' AWS account, and you have been already granted access to do so by the cluster-admin group. After logging into the 'gi-gateway' account at the URL listed here (same as above): https://gi-gateway.signin.aws.amazon.com/console Do the following to switch roles into the 'pangenomics' account (as an example): * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window (again, bill@ucsc.edu is an example). * Click the '''"Switch Role"''' option in the drop-down menu. * In the following menu it will ask you about the role you will be assuming. In our example we will use the following: -Account* = pangenomics -Role* = developer -Display Name = [leave blank, or use a short phrase] -Color = [choose a color for this role] * Then click the "Switch Role" button. If all went well you should be dumped into the 'pangenomics' account, and you should be identified in the top right hand corner of the page as '''"developer @ pangenomics"''', indicating your role and the account you are active in. You can then work as normal in that account. If you have not yet been given access to that role, you will receive an error message and not be allowed to switch roles. '''NOTE:''' When you switch roles, it may dump you into a region that you don't expect it to. Always verify the region you are in by looking at the top right of the web page - it will display your region there. Most of our stuff exists in "Oregon" (us-west-2), but some items appear in other regions on a per-case basis. If you wish to switch context back to the 'gi-gateway' account in order to manage something, or to switch to another role in another account, simple: * Click '''"developer @ pangenomics"''' in the top right corner of the window. * Select '''"Back to bill@ucsc.edu"''' You will then be sent back to the 'gi-gateway' context, and you can add another role to switch into, manage your credentials and further switch roles. == API Access and Secret Keys == If you require programmatic access to AWS, you will very likely be familiar with the AWS concept of Access Keys and Secret Keys, which can be used by scripts to authenticate yourself to AWS and use the APIs there without using the web console to authenticate. In the past, access keys and secret keys could be used by users with no further authentication. This introduces a security risk, as the management of those keys must be carefully guarded - if anyone gets your keys, they can rack up charges on your AWS account without your knowledge! Using the "Assume Role" mechanism we are now using, Access Keys and Secret Keys can still be created by users '''while logged into the gi-gateway account only'''. Do not try to create keys while you have "Switched Roles" into another account. Keys you create in the top level 'gi-gateway' account will work for you in any sub-account you have access to switch roles to. You will need to do a little more configuration for your keys to work from a UNIX command line however. When using the '''"awscli"''' command line tool, assuming you have it installed (the process of which is outside the scope of this document), you would use the steps outlined in this document to configure it: https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html That document has a lot of other really useful information in it - if you plan on using keys for API access, we highly recommend reading it through. It should be noted that we recommend awscli version 1.16.187 or later, as earlier versions have documented issues with using profiles and MFA related actions. You can determine your version of awscli by doing: aws --version Generically, if you plan on using keys for API Access, minimally you will need to configure the "aws" utility and then tweak the config a bit for our setup. To start, run "aws configure". It should look something like this: $ aws configure AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE AWS Secret Access Key [None]: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY Default region name [None]: us-west-2 Default output format [None]: Most folks do that to start. It creates two files: ~/.aws/config ~/.aws/credentials Those two files are important to access AWS via the 'aws' command. '''~/.aws/credentials''' This file contains your access key and secret key, and should not need to be modified after running 'aws configure'. Your same keys can be used to access any roles in any accounts you have access to. '''~/.aws/config''' This file contains some account information you will need to tweak. You may want to configure something like this: [default] region = us-west-2 [profile pangenomics-developer] source_profile = default role_arn = arn:aws:iam::422448306679:role/developer mfa_serial = arn:aws:iam::652235167018:mfa/bill@ucsc.edu duration_seconds = 43200 The "role_arn" line contains the role and account number you are accessing. You can see a list of live account numbers here: [[AWS Account List and Numbers]] Find the account number you need and enter it on the role_arn line, as well as the role name. You will get the role name from the cluster-admin group when you get access. The 'mfa_serial' line contains the identifier for your MFA device. It will always look like '''"arn:aws:iam::652235167018:mfa/[your_iam_username]"'''. The account number listed there will always be "652235167018" because that is the account number of the top level "gi-gateway" account. The "duration_seconds" parameter says that your session token will be 43200 seconds long (12 hours). That means you will only have to authenticate with MFA once every 12 hours. 12 hours is the maximum you can request, although you can specify less than that. This means it won't ask you for MFA every time you run a command for the next 12 hours. Once that is configured, you should be able to reference the profile you just created when using the aws command, like so: $ aws s3 ls --profile pangenomics-developer It will ask you for your MFA code and then run the command. Once you enter the MFA code, the token it creates will be valid for 12 hours if you specified "duration_seconds = 43200", or if you omitted that line, the default session duration is one hour, so you can run other 'aws' cli commands without the need to re-authenticate with MFA for the duration of the session. After the session expires, you will need to authenticate via MFA again. ==Tag Your Resources== When you start using AWS resources (instances, networks, etc), it is very important that you "tag" your resources with the "Owner" tag (note the capital "O"). "Owner" is the key, and the value assigned to it will be your IAM username (i.e. your email address). So, for example, if I spin up an instance, I would tag it during or after creation with something like: Owner = bob@ucsc.edu If you do not tag your instances, '''they will automatically be terminated within 10 minutes.''' Tag your instances especially, but tag every resource you create! This allows us to perform accounting tasks much more easily and allows the Program Managers to know which resources are controlled by who. f46e9180a211296084ffb91c5cb5422b23ae02b2 175 157 2020-01-29T00:23:24Z Weiler 3 /* API Access and Secret Keys */ wikitext text/x-wiki __TOC__ == Getting AWS (Amazon Web Services) Access == The Genomics Institute has a series of AWS Accounts that all support different projects. Often if you become associated with one or more of those projects, you will need access to that account or accounts. The way we are managing AWS IAM Account Access is that we have one AWS account that is the 'top level' account that everyone gets access to, and then, once you log in there, you can "Switch Role" into another sub-account that you are running things in. To get access, you will need your PI or Project Manager to email cluster-admin (cluster-admin@soe.ucsc.edu) asking for an AWS account for you, and also in that email to name the projects you will have access to. The cluster-admin group will contact you with your credentials to login. Once you login, you can change your password if you want to and also you will be able to set up MFA (Multi Factor Authentication) for your account. You will be required to use MFA in order to "Switch Role" into any of the sub-accounts for the projects you are working on. The login URL to use when logging in to the top level account is listed below. The top level account is known as "gi-gateway": https://gi-gateway.signin.aws.amazon.com/console When you login, you '''may''' see a couple error messages on the AWS dashboard saying you don't have access to view certain resources - '''this is normal''', so just ignore the error messages. == Configuring Account Credentials == Once you login to the gi-gateway, you will have very few permissions to do anything there - which is normal, since you will not be working in that account anyway. The gi-gateway account is just there to authenticate you to AWS. '''Changing Your Password''' You can change your password by clicking on your username on the top right of the web browser window, just to the right of the little bell. If your username is bill@ucsc.edu, for example: * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window. * Click the '''"My Security Credentials"''' drop-down menu option. * Click the '''"Change Password"''' button to change your password. Note that we have a password strength policy in place, so your password must conform to the following requirements: * Your password must be at least 10 characters long * Your password must contain at least one lowercase letter * Your password must contain at least one non-alphanumeric character * Your password must contain at least one number You will also need to configure MFA on your account before you will be allowed to switch roles into another account. '''Configuring MFA''' To configure MFA (Multi Factor Authentication), the most common way to do it is to use '''Google Authenticator''', which is an app available for Apple and Android based cell phones and mobile devices. The app is free, simply download it from the app store to your cell phone or tablet to get started. Other MFA apps may also work but we have not tested everything out there. Once you have Google Authenticator installed, log into the gi-gateway account using the above URL, then: * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window (again, bill@ucsc.edu is an example). * Click the '''"My Security Credentials"''' drop-down menu option. * Scroll down to the MFA (Multi-Factor Authentication) section of the page, and click '''"Assign MFA Device"'''. * In the following menu select '''"Virtual MFA Device"'''. * In the following window click the '''"Show QR Code"''' link, and the MFA QR barcode will appear on your screen. * Open the Google Authenticator app on your mobile device, and click the little "+" symbol in the top right corner of the app to add an account. * You will then need to select "Scan Barcode" in the Google Authenticaor app to continue, and aim your mobile device camera at the QR barcode. * The new account MFA device should then be set up and you should see a 6 digit number with a small timer to the right of it. You must type in one 6 digit code that it displays into your web browser when asked, then wait for the next code to appear after the timer expires, and type that into the second field. It should then inform you that you have successfully associated an MFA device with your account. Once you have associated an MFA device with the 'gi-gateway' Account, '''log out''', then log back in. It will ask for your username and password, and then ask for your MFA code, which you can view by opening Google Authenticator and seeing what code it is displaying at that time. The code changes every 30 seconds or so. '''You must log out first and log back in using MFA in order to be able to switch roles!!!''' == Switching Roles into Another AWS Account == Now that you have configured a password and enabled MFA, you will be allowed to "Switch Roles" into another account such that you can begin work there. The first time you switch roles into an account it will ask you a few questions, but subsequently it will remember which roles you have access to and they will become a menu item you can click on to quickly switch roles. Let's assume that you want to switch to the 'pangenomics' AWS account, and you have been already granted access to do so by the cluster-admin group. After logging into the 'gi-gateway' account at the URL listed here (same as above): https://gi-gateway.signin.aws.amazon.com/console Do the following to switch roles into the 'pangenomics' account (as an example): * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window (again, bill@ucsc.edu is an example). * Click the '''"Switch Role"''' option in the drop-down menu. * In the following menu it will ask you about the role you will be assuming. In our example we will use the following: -Account* = pangenomics -Role* = developer -Display Name = [leave blank, or use a short phrase] -Color = [choose a color for this role] * Then click the "Switch Role" button. If all went well you should be dumped into the 'pangenomics' account, and you should be identified in the top right hand corner of the page as '''"developer @ pangenomics"''', indicating your role and the account you are active in. You can then work as normal in that account. If you have not yet been given access to that role, you will receive an error message and not be allowed to switch roles. '''NOTE:''' When you switch roles, it may dump you into a region that you don't expect it to. Always verify the region you are in by looking at the top right of the web page - it will display your region there. Most of our stuff exists in "Oregon" (us-west-2), but some items appear in other regions on a per-case basis. If you wish to switch context back to the 'gi-gateway' account in order to manage something, or to switch to another role in another account, simple: * Click '''"developer @ pangenomics"''' in the top right corner of the window. * Select '''"Back to bill@ucsc.edu"''' You will then be sent back to the 'gi-gateway' context, and you can add another role to switch into, manage your credentials and further switch roles. == API Access and Secret Keys == If you require programmatic access to AWS, you will very likely be familiar with the AWS concept of Access Keys and Secret Keys, which can be used by scripts to authenticate yourself to AWS and use the APIs there without using the web console to authenticate. In the past, access keys and secret keys could be used by users with no further authentication. This introduces a security risk, as the management of those keys must be carefully guarded - if anyone gets your keys, they can rack up charges on your AWS account without your knowledge! Using the "Assume Role" mechanism we are now using, Access Keys and Secret Keys can still be created by users '''while logged into the gi-gateway account only'''. Do not try to create keys while you have "Switched Roles" into another account. Keys you create in the top level 'gi-gateway' account will work for you in any sub-account you have access to switch roles to. You will need to do a little more configuration for your keys to work from a UNIX command line however. It should be noted that we recommend awscli version 1.16.187 or later, as earlier versions have documented issues with using profiles and MFA related actions. You can determine your version of awscli by doing: aws --version Generically, if you plan on using keys for API Access, minimally you will need to configure the "aws" utility and then tweak the config a bit for our setup. To start, run "aws configure". It should look something like this: $ aws configure AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE AWS Secret Access Key [None]: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY Default region name [None]: us-west-2 Default output format [None]: Most folks do that to start. It creates two files: ~/.aws/config ~/.aws/credentials Those two files are important to access AWS via the 'aws' command. '''~/.aws/credentials''' This file contains your access key and secret key, and should not need to be modified after running 'aws configure'. Your same keys can be used to access any roles in any accounts you have access to. '''~/.aws/config''' This file contains some account information you will need to tweak. You may want to configure something like this: [default] region = us-west-2 [profile pangenomics-developer] source_profile = default role_arn = arn:aws:iam::422448306679:role/developer mfa_serial = arn:aws:iam::652235167018:mfa/bill@ucsc.edu duration_seconds = 43200 The "role_arn" line contains the role and account number you are accessing. You can see a list of live account numbers here: [[AWS Account List and Numbers]] Find the account number you need and enter it on the role_arn line, as well as the role name. You will get the role name from the cluster-admin group when you get access. The 'mfa_serial' line contains the identifier for your MFA device. It will always look like '''"arn:aws:iam::652235167018:mfa/[your_iam_username]"'''. The account number listed there will always be "652235167018" because that is the account number of the top level "gi-gateway" account. The "duration_seconds" parameter says that your session token will be 43200 seconds long (12 hours). That means you will only have to authenticate with MFA once every 12 hours. 12 hours is the maximum you can request, although you can specify less than that. This means it won't ask you for MFA every time you run a command for the next 12 hours. Once that is configured, you should be able to reference the profile you just created when using the aws command, like so: $ aws s3 ls --profile pangenomics-developer It will ask you for your MFA code and then run the command. Once you enter the MFA code, the token it creates will be valid for 12 hours if you specified "duration_seconds = 43200", or if you omitted that line, the default session duration is one hour, so you can run other 'aws' cli commands without the need to re-authenticate with MFA for the duration of the session. After the session expires, you will need to authenticate via MFA again. ==Tag Your Resources== When you start using AWS resources (instances, networks, etc), it is very important that you "tag" your resources with the "Owner" tag (note the capital "O"). "Owner" is the key, and the value assigned to it will be your IAM username (i.e. your email address). So, for example, if I spin up an instance, I would tag it during or after creation with something like: Owner = bob@ucsc.edu If you do not tag your instances, '''they will automatically be terminated within 10 minutes.''' Tag your instances especially, but tag every resource you create! This allows us to perform accounting tasks much more easily and allows the Program Managers to know which resources are controlled by who. d4cef684ad25a9e0af9425b79d020c8ece4fc600 176 175 2020-01-29T00:28:29Z Weiler 3 /* API Access and Secret Keys */ wikitext text/x-wiki __TOC__ == Getting AWS (Amazon Web Services) Access == The Genomics Institute has a series of AWS Accounts that all support different projects. Often if you become associated with one or more of those projects, you will need access to that account or accounts. The way we are managing AWS IAM Account Access is that we have one AWS account that is the 'top level' account that everyone gets access to, and then, once you log in there, you can "Switch Role" into another sub-account that you are running things in. To get access, you will need your PI or Project Manager to email cluster-admin (cluster-admin@soe.ucsc.edu) asking for an AWS account for you, and also in that email to name the projects you will have access to. The cluster-admin group will contact you with your credentials to login. Once you login, you can change your password if you want to and also you will be able to set up MFA (Multi Factor Authentication) for your account. You will be required to use MFA in order to "Switch Role" into any of the sub-accounts for the projects you are working on. The login URL to use when logging in to the top level account is listed below. The top level account is known as "gi-gateway": https://gi-gateway.signin.aws.amazon.com/console When you login, you '''may''' see a couple error messages on the AWS dashboard saying you don't have access to view certain resources - '''this is normal''', so just ignore the error messages. == Configuring Account Credentials == Once you login to the gi-gateway, you will have very few permissions to do anything there - which is normal, since you will not be working in that account anyway. The gi-gateway account is just there to authenticate you to AWS. '''Changing Your Password''' You can change your password by clicking on your username on the top right of the web browser window, just to the right of the little bell. If your username is bill@ucsc.edu, for example: * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window. * Click the '''"My Security Credentials"''' drop-down menu option. * Click the '''"Change Password"''' button to change your password. Note that we have a password strength policy in place, so your password must conform to the following requirements: * Your password must be at least 10 characters long * Your password must contain at least one lowercase letter * Your password must contain at least one non-alphanumeric character * Your password must contain at least one number You will also need to configure MFA on your account before you will be allowed to switch roles into another account. '''Configuring MFA''' To configure MFA (Multi Factor Authentication), the most common way to do it is to use '''Google Authenticator''', which is an app available for Apple and Android based cell phones and mobile devices. The app is free, simply download it from the app store to your cell phone or tablet to get started. Other MFA apps may also work but we have not tested everything out there. Once you have Google Authenticator installed, log into the gi-gateway account using the above URL, then: * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window (again, bill@ucsc.edu is an example). * Click the '''"My Security Credentials"''' drop-down menu option. * Scroll down to the MFA (Multi-Factor Authentication) section of the page, and click '''"Assign MFA Device"'''. * In the following menu select '''"Virtual MFA Device"'''. * In the following window click the '''"Show QR Code"''' link, and the MFA QR barcode will appear on your screen. * Open the Google Authenticator app on your mobile device, and click the little "+" symbol in the top right corner of the app to add an account. * You will then need to select "Scan Barcode" in the Google Authenticaor app to continue, and aim your mobile device camera at the QR barcode. * The new account MFA device should then be set up and you should see a 6 digit number with a small timer to the right of it. You must type in one 6 digit code that it displays into your web browser when asked, then wait for the next code to appear after the timer expires, and type that into the second field. It should then inform you that you have successfully associated an MFA device with your account. Once you have associated an MFA device with the 'gi-gateway' Account, '''log out''', then log back in. It will ask for your username and password, and then ask for your MFA code, which you can view by opening Google Authenticator and seeing what code it is displaying at that time. The code changes every 30 seconds or so. '''You must log out first and log back in using MFA in order to be able to switch roles!!!''' == Switching Roles into Another AWS Account == Now that you have configured a password and enabled MFA, you will be allowed to "Switch Roles" into another account such that you can begin work there. The first time you switch roles into an account it will ask you a few questions, but subsequently it will remember which roles you have access to and they will become a menu item you can click on to quickly switch roles. Let's assume that you want to switch to the 'pangenomics' AWS account, and you have been already granted access to do so by the cluster-admin group. After logging into the 'gi-gateway' account at the URL listed here (same as above): https://gi-gateway.signin.aws.amazon.com/console Do the following to switch roles into the 'pangenomics' account (as an example): * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window (again, bill@ucsc.edu is an example). * Click the '''"Switch Role"''' option in the drop-down menu. * In the following menu it will ask you about the role you will be assuming. In our example we will use the following: -Account* = pangenomics -Role* = developer -Display Name = [leave blank, or use a short phrase] -Color = [choose a color for this role] * Then click the "Switch Role" button. If all went well you should be dumped into the 'pangenomics' account, and you should be identified in the top right hand corner of the page as '''"developer @ pangenomics"''', indicating your role and the account you are active in. You can then work as normal in that account. If you have not yet been given access to that role, you will receive an error message and not be allowed to switch roles. '''NOTE:''' When you switch roles, it may dump you into a region that you don't expect it to. Always verify the region you are in by looking at the top right of the web page - it will display your region there. Most of our stuff exists in "Oregon" (us-west-2), but some items appear in other regions on a per-case basis. If you wish to switch context back to the 'gi-gateway' account in order to manage something, or to switch to another role in another account, simple: * Click '''"developer @ pangenomics"''' in the top right corner of the window. * Select '''"Back to bill@ucsc.edu"''' You will then be sent back to the 'gi-gateway' context, and you can add another role to switch into, manage your credentials and further switch roles. == API Access and Secret Keys == If you require programmatic access to AWS, you will very likely be familiar with the AWS concept of Access Keys and Secret Keys, which can be used by scripts to authenticate yourself to AWS and use the APIs there without using the web console to authenticate. In the past, access keys and secret keys could be used by users with no further authentication. This introduces a security risk, as the management of those keys must be carefully guarded - if anyone gets your keys, they can rack up charges on your AWS account without your knowledge! Using the "Assume Role" mechanism we are now using, Access Keys and Secret Keys can still be created by users '''while logged into the gi-gateway account only'''. Do not try to create keys while you have "Switched Roles" into another account. Keys you create in the top level 'gi-gateway' account will work for you in any sub-account you have access to switch roles to. You will need to do a little more configuration for your keys to work from a UNIX command line however. To set up your access and secret keys for the first time (again, logged into the 'gi-gateway' account only), follow these instructions. Once you log into the gi-gateway web interface, click on your username in the top right corner of the browser window, then click "My Security Credentials". In that screen you will see an "Access Keys" section, and you will have one key listed. Delete that key (using the "Delete" button on the right side of the key), then create a new key using the "Create Access Key" button. It will show you your access and secret key ONCE, so make sure to copy and paste it somewhere. It should be noted that we recommend awscli version 1.16.187 or later, as earlier versions have documented issues with using profiles and MFA related actions. You can determine your version of awscli by doing: aws --version Generically, if you plan on using keys for API Access, minimally you will need to configure the "aws" utility and then tweak the config a bit for our setup. To start, run "aws configure". It should look something like this (put in your access and secret keys that you created in the previous step): $ aws configure AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE AWS Secret Access Key [None]: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY Default region name [None]: us-west-2 Default output format [None]: Most folks do that to start. It creates two files: ~/.aws/config ~/.aws/credentials Those two files are important to access AWS via the 'aws' command. '''~/.aws/credentials''' This file contains your access key and secret key, and should not need to be modified after running 'aws configure'. Your same keys can be used to access any roles in any accounts you have access to. '''~/.aws/config''' This file contains some account information you will need to tweak. You may want to configure something like this: [default] region = us-west-2 [profile pangenomics-developer] source_profile = default role_arn = arn:aws:iam::422448306679:role/developer mfa_serial = arn:aws:iam::652235167018:mfa/bill@ucsc.edu duration_seconds = 43200 The "role_arn" line contains the role and account number you are accessing. You can see a list of live account numbers here: [[AWS Account List and Numbers]] Find the account number you need and enter it on the role_arn line, as well as the role name. You will get the role name from the cluster-admin group when you get access. The 'mfa_serial' line contains the identifier for your MFA device. It will always look like '''"arn:aws:iam::652235167018:mfa/[your_iam_username]"'''. The account number listed there will always be "652235167018" because that is the account number of the top level "gi-gateway" account. The "duration_seconds" parameter says that your session token will be 43200 seconds long (12 hours). That means you will only have to authenticate with MFA once every 12 hours. 12 hours is the maximum you can request, although you can specify less than that. This means it won't ask you for MFA every time you run a command for the next 12 hours. Once that is configured, you should be able to reference the profile you just created when using the aws command, like so: $ aws s3 ls --profile pangenomics-developer It will ask you for your MFA code and then run the command. Once you enter the MFA code, the token it creates will be valid for 12 hours if you specified "duration_seconds = 43200", or if you omitted that line, the default session duration is one hour, so you can run other 'aws' cli commands without the need to re-authenticate with MFA for the duration of the session. After the session expires, you will need to authenticate via MFA again. ==Tag Your Resources== When you start using AWS resources (instances, networks, etc), it is very important that you "tag" your resources with the "Owner" tag (note the capital "O"). "Owner" is the key, and the value assigned to it will be your IAM username (i.e. your email address). So, for example, if I spin up an instance, I would tag it during or after creation with something like: Owner = bob@ucsc.edu If you do not tag your instances, '''they will automatically be terminated within 10 minutes.''' Tag your instances especially, but tag every resource you create! This allows us to perform accounting tasks much more easily and allows the Program Managers to know which resources are controlled by who. c0c0a858e05697c8ad3a1e67d55c1a810d53999b AWS Account List and Numbers 0 22 162 140 2019-11-04T17:43:51Z Weiler 3 wikitext text/x-wiki This is a list of our currently available AWS accounts and their account numbers: ucsc-bd2k : 862902209576 ucsc-toil-dev : 318423852362 ucsc-vg-dev : 781907127277 ucsc-platform-dev : 719818754276 comparative-genomics-dev : 162786355865 nanopore-dev : 270442831226 ucsc-cgp-production : 097093801910 platform-hca : 122796619775 anvil-dev : 608666466534 gi-gateway : 652235167018 pangenomics : 422448306679 braingeneers : 443872533066 ucsctreehouse : 238605363322 ucsc-bisti-dev : 851631505710 dockstore-dev : 635220370222 a85670292a54a2f7559175e958b3a0f1130a670b 173 162 2020-01-22T18:54:35Z Weiler 3 wikitext text/x-wiki This is a list of our currently available AWS accounts and their account numbers: ucsc-bd2k : 862902209576 ucsc-toil-dev : 318423852362 ucsc-vg-dev : 781907127277 ucsc-platform-dev : 719818754276 comparative-genomics-dev : 162786355865 nanopore-dev : 270442831226 ucsc-cgp-production : 097093801910 platform-sc : 122796619775 anvil-dev : 608666466534 gi-gateway : 652235167018 pangenomics : 422448306679 braingeneers : 443872533066 ucsctreehouse : 238605363322 ucsc-bisti-dev : 851631505710 dockstore-dev : 635220370222 91991c226298d664984773fb3188ddd34ac3d759 Requirement for users to get GI VPN access 0 9 166 88 2020-01-03T22:45:44Z Haifang 1 wikitext text/x-wiki If you need VPN access to the Genomics Institute firewalled/secure area (aka the "Prism" or "CIRM" Environment), please make an appointment with the GI SysAdmin team by emailing ''cluster-admin@soe.ucsc.edu'' requesting access. There are several requirements to gaining access to the firewalled area - please complete all these requirements '''BEFORE''' coming to have the VPN software set up for your laptop. Please use this checklist to make sure that you have completed all '''six''' requirements. '''1'''. User info, your PI info and your PI's approval '''2'''. NIH Public Security Refresher Course Certificate '''3'''. Signed Genomics Institute VPN User Agreement '''4'''. Signed NIH Genomic Data Sharing Policy Agreement '''5'''. "eduroam" wireless network has setup on your laptop '''6'''. Installed the appropriate OpenVPN software on your laptop '''1''': You are required to ask your PI or sponsor to email cluster-admin@soe.ucsc.edu requesting a VPN account for you - this email should include: Your name Your PI's name Your requested username (If your name is Jane Doe, then your username could be 'jdoe' for example). PI's approval for this access What other access you need such as a UNIX server account or access to OpenStack. '''2''': You must take the NIH Public Security Refresher Course online, then print out the Completion Certificate (which should have your name on it) at the end of the training and deliver it when you come to your appointment to install the VPN software. You must complete the course in a single continuous sitting in order to be able to print the certificate at the end: https://irtsectraining.nih.gov/publicUser.aspx Click on the "2018 Information Security, Counterintelligence, Privacy Awareness, Records Management Refresher" link to begin the course. At the end you will be able to print out the completion certificate that should have your name on it. '''3''': You need to print and sign the Genomics Institute VPN User Agreement and bring it with you to your VPN software installation appointment, located here for download: [[Media:GI_VPN_Policy.pdf]] '''4''': Please print, read and sign the last page of the NIH Genomic Data Sharing Policy agreement, located here for download. Just staple the pages together. Please bring the signed document to your appointment. By signing the document you agree that you have read and understand the policies described therein and that you agree to abide by those policies: [[Media:NIH_GDS_Policy.pdf]] '''5''': You will need access to the "eduroam" wireless network '''prior''' to your appointment. Other UCSC wireless networks such as "cruznet" will not work with our VPN software, so please make sure your laptop works with eduroam before coming to your appointment. Instructions on how to get on eduroam are detailed here: https://its.ucsc.edu/wireless/eduroam.html When using the VPN software off-campus, it will usually work unless the wireless network you are on has restrictions preventing it from functioning. Some other universities have such restrictions (notably UCSF), but most other wireless network and home wireless networks should work fine. '''6''': Before your appointment, please make sure you install the appropriate OpenVPN software on your laptop: A laptop running OS X, Windows or Ubuntu For Macs, please download and install '''Tunnelblick''' from https://tunnelblick.net/downloads.html. Select the Latest Stable version. For Windows, please download and install '''OpenVPN Client''' from https://openvpn.net/index.php/open-source/downloads.html. Select ''openvpn-install-x.x.x-xxxx.exe'' For Ubuntu, please install network-manager-openvpn by typing: sudo apt-get install network-manager-openvpn network-manager-openvpn-gnome Please do NOT worry about how to configure the software at this point. We will help you to set it up at your appointment. We will correspond with you via email on when the appointment will be. The appointment can take up to 30 minutes per person depending on whether or not any issues come up during the software setup. If you show up for your appointment without one (or more) of the above outlined requirements, we will have to reschedule your appointment for a time when you can arrive after completing the above requirements. '''PLEASE NOTE:''' Because of the overhead required in setting up VPN access, please only request access if you have an immediate need to work on data that exists behind the firewall. We have had a decent number of people request access and go through the setup but then never use it. In other words, please do not request access because "one day you might need it", but because you '''do''' actually need it! '''ALSO NOTE:''' VPN accounts expire after one year from the date of first gaining access. To renew for another year you will need your PI/sponsor to send us a note asking for renewal. 84c3ef0009a0cae7a00957660f179551609fd939 Genomics Institute Computing Information 0 6 167 143 2020-01-14T22:44:10Z Weiler 3 /* Kubernetes Information */ wikitext text/x-wiki Welcome to the Genomic Institute Computing Information Repository! Browse the below topics for help in the area you are curious about. ==Datacenter Migration== *[[Public Genomics Institute Infrastructure Ready for Migration]] ==GI Public Computing Environment== *[[How to access the public servers]] ==GI Firewalled Computing Environment (PRISM)== *[[Access to the Firewalled Compute Servers]] ==VPN Access== *[[Requirement for users to get GI VPN access]] == NIH dbGaP Access Requirements == *[[Requirements for dbGaP Access]] == giCloud Openstack == *[[Overview of giCloud in the Genomics Institute]] == Amazon Web Services Account Management == *[[Overview of Getting and Using an AWS IAM Account]] *[[AWS Account List and Numbers]] ==Kubernetes Information== *[[Computational Genomics Kubernetes Installation]] *[[Undiagnosed Diseaase Project Kubernetes Installation]] d6b12baa2020029b3e8f1937189280c193fa3161 169 167 2020-01-14T22:51:12Z Weiler 3 /* Kubernetes Information */ wikitext text/x-wiki Welcome to the Genomic Institute Computing Information Repository! Browse the below topics for help in the area you are curious about. ==Datacenter Migration== *[[Public Genomics Institute Infrastructure Ready for Migration]] ==GI Public Computing Environment== *[[How to access the public servers]] ==GI Firewalled Computing Environment (PRISM)== *[[Access to the Firewalled Compute Servers]] ==VPN Access== *[[Requirement for users to get GI VPN access]] == NIH dbGaP Access Requirements == *[[Requirements for dbGaP Access]] == giCloud Openstack == *[[Overview of giCloud in the Genomics Institute]] == Amazon Web Services Account Management == *[[Overview of Getting and Using an AWS IAM Account]] *[[AWS Account List and Numbers]] ==Kubernetes Information== *[[Computational Genomics Kubernetes Installation]] *[[Undiagnosed Disease Project Kubernetes Installation]] 967c26d84e22d95f9c460e4c208c532ce808677c 174 169 2020-01-23T17:16:54Z Weiler 3 wikitext text/x-wiki Welcome to the Genomic Institute Computing Information Repository! Browse the below topics for help in the area you are curious about. ==GI Public Computing Environment== *[[How to access the public servers]] ==GI Firewalled Computing Environment (PRISM)== *[[Access to the Firewalled Compute Servers]] ==VPN Access== *[[Requirement for users to get GI VPN access]] == NIH dbGaP Access Requirements == *[[Requirements for dbGaP Access]] == giCloud Openstack == *[[Overview of giCloud in the Genomics Institute]] == Amazon Web Services Account Management == *[[Overview of Getting and Using an AWS IAM Account]] *[[AWS Account List and Numbers]] ==Kubernetes Information== *[[Computational Genomics Kubernetes Installation]] *[[Undiagnosed Disease Project Kubernetes Installation]] 645705ee1a12937901e139de5305890bac22bd4d 181 174 2020-04-28T16:58:55Z Weiler 3 /* giCloud Openstack */ wikitext text/x-wiki Welcome to the Genomic Institute Computing Information Repository! Browse the below topics for help in the area you are curious about. ==GI Public Computing Environment== *[[How to access the public servers]] ==GI Firewalled Computing Environment (PRISM)== *[[Access to the Firewalled Compute Servers]] ==VPN Access== *[[Requirement for users to get GI VPN access]] == NIH dbGaP Access Requirements == *[[Requirements for dbGaP Access]] == giCloud Openstack == *[[Overview of giCloud in the Genomics Institute]] *[[Quick Start Instructions to Get Rolling with OpenStack]] == Amazon Web Services Account Management == *[[Overview of Getting and Using an AWS IAM Account]] *[[AWS Account List and Numbers]] ==Kubernetes Information== *[[Computational Genomics Kubernetes Installation]] *[[Undiagnosed Disease Project Kubernetes Installation]] 1c117b5185b4177c9449dd1b7265af5ba7e0474b Undiagnosed Diseaase Project Kubernetes Installation 0 24 168 2020-01-14T22:50:27Z Weiler 3 Created page with "__TOC__ The Undiagnosed Disease Project (UDP) has a Kubernetes Cluster running on one large GPU server. The current cluster makeup includes one worker node with the followin..." wikitext text/x-wiki __TOC__ The Undiagnosed Disease Project (UDP) has a Kubernetes Cluster running on one large GPU server. The current cluster makeup includes one worker node with the following specs: * 72 CPU cores (3.1 GHz) * 384 GB RAM * 3.2 TB Local NVMe Flash Storage * 4 NVIDIA GPUs * 25 Gb/s Network Interface ==Getting Authorized to Connect== If you require access to this kubernetes cluster, contact Benedict Paten asking for permission to use it, then pass on that permission via email to: cluster-admin@soe.ucsc.edu Let us know which group you are with and we can authorize you to use the cluster in the correct namespace. '''NOTE:''' You need GI VPN Access to access this kubernetes installation. ==Authenticating to Kubernetes== We will authorize (authz) you to use the cluster on the server side, but you will also need to authenticate (authn) using your '@ucsc.edu' email address and a unique Java Web token. These credentials are installed in ~/.kube/config in whatever machine you are coming from to get to the cluster. To authenticate and get your base kubernetes configuration, go to this URL (below), which will ask you to authenticate to Google. Use your '@ucsc.edu' email address as the login. It will then ask you to authenticate via CruzID Gold if your web browser doesn't already have the authentication token cached: https://udp-kube-auth.gi.ucsc.edu Once you authenticate (via username/password and 2-factor auth for CruzID Gold), it will pass you back to the 'https://udp-kube-auth.gi.ucsc.edu' website and it should confirm authentication on the top with a message saying "Successfully Authenticated". '''If you see any errors in red,''' but are sure you typed in your password and 2-factor auth correctly, click on the above link again (https://udp-kube-auth.gi.ucsc.edu) and authenticate a second time, which should work. There is a quirk where the web token doesn't always pass back to us correctly on the first try. Upon success, you will be able to click the blue "Download Config File" button, which contains your initial kubernetes config file. Copy this file to your home directory as ~/.kube/config. Follow the directions on the web page to insert your '''"namespace:"''' line as directed. We will let you know which namespace to use. ==Testing Connectivity== Once your ~/.kube/config file is set up correctly, you should be able to connect to the cluster. All our shared servers here at the Genomics Institute have the 'kubectl' command installed on them, but if you are coming from somewhere else, just make sure the "kubectl" utility is installed on that machine. A quick test should go as follows: $ kubectl get nodes NAME STATUS ROLES AGE VERSION k1.kube Ready <none> 13h v1.15.3 k2.kube Ready <none> 13h v1.15.3 k3.kube Ready <none> 13h v1.15.3 master.kube Ready master 13h v1.15.3 ==Running Pods and Jobs with Requests and Limits== When running jobs and pods on kubernetes, you will always want to specify "requests" and "limits" on resources, otherwise your pods will get stuck with the default limits which are tiny (to protect against runaway pods). You should always have an idea of how much resources will be consumed by your jobs, and not use much more than that, in order not to hog all the resources. It also prevents your jib from "running away" unexpectedly and chewing up more resources than expected. Here is a good example of a job file that specifies limits: job.yml apiVersion: batch/v1 kind: Job metadata: name: $USER-$TS spec: backoffLimit: 0 ttlSecondsAfterFinished: 30 template: spec: containers: - name: magic image: robcurrie/ubuntu imagePullPolicy: Always resources: requests: cpu: "1" memory: "2G" ephemeral-storage: "2G" limits: cpu: "1" memory: "2G" ephemeral-storage: "3G" command: ["/bin/bash", "-c"] args: ['for i in {1..100}; do echo "$i: $(date)"; i=$((i+1)); sleep 1; done'] restartPolicy: Never Please note that the "request" and "limit" item fields should be the same. You would think that you could set the limit higher than the request, but in reality they need to match in order for the pod to stay within the kubernetes resource limit bubble. If you set the limit higher than the request, then you are risking the pod using more memory than the scheduler expects, and the node can start killing off random other colocated pods by way of OOM, which is very, very bad for the cluster. '''NOTE:''' Jobs and pods that have '''completed''' over 72 hours ago but have not been cleaned up will be automatically removed by the garbage collector. Most jobs will have the "ttlSecondsAfterFinished" configuration item in them, so they will automatically cleaned up after that time expires, but leaving the old pods and jobs around pins the disk space they were using while they remain, so it's good to get rid of them as soon as they are done unless you are debugging a failure or something like that. Jobs that '''run''' over 72 hours will not be deleted, only the ones that have '''exited''' over 72 hours ago. A lot of other good information can be viewed on Rob Currie's github page, which includes examples and some "How To" documentation: https://github.com/rcurrie/kubernetes ==View the Cluster's Current Activity== One quick way to check the cluster's utilization is to do: kubectl top nodes NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% udp-k8s-1 1815m 1% 1191Mi 0% udp-k8s-master 111m 5% 1024Mi 46% Ignore the master node as that one only handles cluster management and doesn't run jobs or pods for users. Another good way to get a lot of details about the current state of the cluster is through the Kubernetes Dashboard: https://cgl-k8s-dashboard.gi.ucsc.edu/ Select the "token" login method, and paste in this (long) token: eyJhbGciOiJSUzI1NiIsImtpZCI6IiJ9.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJrdWJlcm5ldGVzLWRhc2hib2FyZCIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJrdWJlcm5ldGVzLWRhc2hib2FyZC10b2tlbi0ycDY4cCIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50Lm5hbWUiOiJrdWJlcm5ldGVzLWRhc2hib2FyZCIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50LnVpZCI6ImE5ZGI2Y2I0LWMyYWYtNDM3My04ZmM2LWE4YWYwYTBmNGRkNCIsInN1YiI6InN5c3RlbTpzZXJ2aWNlYWNjb3VudDprdWJlcm5ldGVzLWRhc2hib2FyZDprdWJlcm5ldGVzLWRhc2hib2FyZCJ9.ZVQjryG_ksfvReIfq4Frb6M4sE6OVDOXnFy9Aii-h3mrpdHRE6bgjdAvSGZ0jJSIUEz5GgPBQ0lCwhyZocivHHr4zTrNxMkOFZhPDnpvF6RVIDWTkqmH9Dg6qmro0gTJP75oKBpt7dFN2pW4zvqOAzqPmh7qxfoVusN8X6U13YirMFEf65-aGL-_FFNBsEzvjkC-BgXWbtk3YZc8CJL7xtvlKLyE6u6jC9Qx0SWnwzkALlxmzo_yYTDKpIrWiQGEqzLQOxKml-H0kSYLDX-t4sTivXp4vCw_ruoqwIpLnnQAC7q3ZtSTxHIrxbB7n_M8gfhpXtwprbPav-XmBk1xaQ The dashboard is read-only, so you won't be able to edit anything, it's mostly for seeing what's going on and where. 35b4d6f38a88cc371e8933e24f763396e22b36a9 Undiagnosed Disease Project Kubernetes Installation 0 25 170 2020-01-14T22:51:33Z Weiler 3 Created page with "__TOC__ The Undiagnosed Disease Project (UDP) has a Kubernetes Cluster running on one large GPU server. The current cluster makeup includes one worker node with the followin..." wikitext text/x-wiki __TOC__ The Undiagnosed Disease Project (UDP) has a Kubernetes Cluster running on one large GPU server. The current cluster makeup includes one worker node with the following specs: * 72 CPU cores (3.1 GHz) * 384 GB RAM * 3.2 TB Local NVMe Flash Storage * 4 NVIDIA GPUs * 25 Gb/s Network Interface ==Getting Authorized to Connect== If you require access to this kubernetes cluster, contact Benedict Paten asking for permission to use it, then pass on that permission via email to: cluster-admin@soe.ucsc.edu Let us know which group you are with and we can authorize you to use the cluster in the correct namespace. '''NOTE:''' You need GI VPN Access to access this kubernetes installation. ==Authenticating to Kubernetes== We will authorize (authz) you to use the cluster on the server side, but you will also need to authenticate (authn) using your '@ucsc.edu' email address and a unique Java Web token. These credentials are installed in ~/.kube/config in whatever machine you are coming from to get to the cluster. To authenticate and get your base kubernetes configuration, go to this URL (below), which will ask you to authenticate to Google. Use your '@ucsc.edu' email address as the login. It will then ask you to authenticate via CruzID Gold if your web browser doesn't already have the authentication token cached: https://udp-kube-auth.gi.ucsc.edu Once you authenticate (via username/password and 2-factor auth for CruzID Gold), it will pass you back to the 'https://udp-kube-auth.gi.ucsc.edu' website and it should confirm authentication on the top with a message saying "Successfully Authenticated". '''If you see any errors in red,''' but are sure you typed in your password and 2-factor auth correctly, click on the above link again (https://udp-kube-auth.gi.ucsc.edu) and authenticate a second time, which should work. There is a quirk where the web token doesn't always pass back to us correctly on the first try. Upon success, you will be able to click the blue "Download Config File" button, which contains your initial kubernetes config file. Copy this file to your home directory as ~/.kube/config. Follow the directions on the web page to insert your '''"namespace:"''' line as directed. We will let you know which namespace to use. ==Testing Connectivity== Once your ~/.kube/config file is set up correctly, you should be able to connect to the cluster. All our shared servers here at the Genomics Institute have the 'kubectl' command installed on them, but if you are coming from somewhere else, just make sure the "kubectl" utility is installed on that machine. A quick test should go as follows: $ kubectl get nodes NAME STATUS ROLES AGE VERSION k1.kube Ready <none> 13h v1.15.3 k2.kube Ready <none> 13h v1.15.3 k3.kube Ready <none> 13h v1.15.3 master.kube Ready master 13h v1.15.3 ==Running Pods and Jobs with Requests and Limits== When running jobs and pods on kubernetes, you will always want to specify "requests" and "limits" on resources, otherwise your pods will get stuck with the default limits which are tiny (to protect against runaway pods). You should always have an idea of how much resources will be consumed by your jobs, and not use much more than that, in order not to hog all the resources. It also prevents your jib from "running away" unexpectedly and chewing up more resources than expected. Here is a good example of a job file that specifies limits: job.yml apiVersion: batch/v1 kind: Job metadata: name: $USER-$TS spec: backoffLimit: 0 ttlSecondsAfterFinished: 30 template: spec: containers: - name: magic image: robcurrie/ubuntu imagePullPolicy: Always resources: requests: cpu: "1" memory: "2G" ephemeral-storage: "2G" limits: cpu: "1" memory: "2G" ephemeral-storage: "3G" command: ["/bin/bash", "-c"] args: ['for i in {1..100}; do echo "$i: $(date)"; i=$((i+1)); sleep 1; done'] restartPolicy: Never Please note that the "request" and "limit" item fields should be the same. You would think that you could set the limit higher than the request, but in reality they need to match in order for the pod to stay within the kubernetes resource limit bubble. If you set the limit higher than the request, then you are risking the pod using more memory than the scheduler expects, and the node can start killing off random other colocated pods by way of OOM, which is very, very bad for the cluster. '''NOTE:''' Jobs and pods that have '''completed''' over 72 hours ago but have not been cleaned up will be automatically removed by the garbage collector. Most jobs will have the "ttlSecondsAfterFinished" configuration item in them, so they will automatically cleaned up after that time expires, but leaving the old pods and jobs around pins the disk space they were using while they remain, so it's good to get rid of them as soon as they are done unless you are debugging a failure or something like that. Jobs that '''run''' over 72 hours will not be deleted, only the ones that have '''exited''' over 72 hours ago. A lot of other good information can be viewed on Rob Currie's github page, which includes examples and some "How To" documentation: https://github.com/rcurrie/kubernetes ==View the Cluster's Current Activity== One quick way to check the cluster's utilization is to do: kubectl top nodes NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% udp-k8s-1 1815m 1% 1191Mi 0% udp-k8s-master 111m 5% 1024Mi 46% Ignore the master node as that one only handles cluster management and doesn't run jobs or pods for users. Another good way to get a lot of details about the current state of the cluster is through the Kubernetes Dashboard: https://cgl-k8s-dashboard.gi.ucsc.edu/ Select the "token" login method, and paste in this (long) token: eyJhbGciOiJSUzI1NiIsImtpZCI6IiJ9.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJrdWJlcm5ldGVzLWRhc2hib2FyZCIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJrdWJlcm5ldGVzLWRhc2hib2FyZC10b2tlbi0ycDY4cCIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50Lm5hbWUiOiJrdWJlcm5ldGVzLWRhc2hib2FyZCIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50LnVpZCI6ImE5ZGI2Y2I0LWMyYWYtNDM3My04ZmM2LWE4YWYwYTBmNGRkNCIsInN1YiI6InN5c3RlbTpzZXJ2aWNlYWNjb3VudDprdWJlcm5ldGVzLWRhc2hib2FyZDprdWJlcm5ldGVzLWRhc2hib2FyZCJ9.ZVQjryG_ksfvReIfq4Frb6M4sE6OVDOXnFy9Aii-h3mrpdHRE6bgjdAvSGZ0jJSIUEz5GgPBQ0lCwhyZocivHHr4zTrNxMkOFZhPDnpvF6RVIDWTkqmH9Dg6qmro0gTJP75oKBpt7dFN2pW4zvqOAzqPmh7qxfoVusN8X6U13YirMFEf65-aGL-_FFNBsEzvjkC-BgXWbtk3YZc8CJL7xtvlKLyE6u6jC9Qx0SWnwzkALlxmzo_yYTDKpIrWiQGEqzLQOxKml-H0kSYLDX-t4sTivXp4vCw_ruoqwIpLnnQAC7q3ZtSTxHIrxbB7n_M8gfhpXtwprbPav-XmBk1xaQ The dashboard is read-only, so you won't be able to edit anything, it's mostly for seeing what's going on and where. 35b4d6f38a88cc371e8933e24f763396e22b36a9 171 170 2020-01-14T23:08:29Z Weiler 3 wikitext text/x-wiki __TOC__ The Undiagnosed Disease Project (UDP) has a Kubernetes Cluster running on one large GPU server. The current cluster makeup includes one worker node with the following specs: * 72 CPU cores (3.1 GHz) * 384 GB RAM * 3.2 TB Local NVMe Flash Storage * 4 NVIDIA GPUs * 25 Gb/s Network Interface ==Getting Authorized to Connect== If you require access to this kubernetes cluster, contact Benedict Paten asking for permission to use it, then pass on that permission via email to: cluster-admin@soe.ucsc.edu Let us know which group you are with and we can authorize you to use the cluster in the correct namespace. '''NOTE:''' You need GI VPN Access to access this kubernetes installation. ==Authenticating to Kubernetes== We will authorize (authz) you to use the cluster on the server side, but you will also need to authenticate (authn) using your '@ucsc.edu' email address and a unique Java Web token. These credentials are installed in ~/.kube/config in whatever machine you are coming from to get to the cluster. To authenticate and get your base kubernetes configuration, go to this URL (below), which will ask you to authenticate to Google. Use your '@ucsc.edu' email address as the login. It will then ask you to authenticate via CruzID Gold if your web browser doesn't already have the authentication token cached: https://udp-kube-auth.gi.ucsc.edu Once you authenticate (via username/password and 2-factor auth for CruzID Gold), it will pass you back to the 'https://udp-kube-auth.gi.ucsc.edu' website and it should confirm authentication on the top with a message saying "Successfully Authenticated". '''If you see any errors in red,''' but are sure you typed in your password and 2-factor auth correctly, click on the above link again (https://udp-kube-auth.gi.ucsc.edu) and authenticate a second time, which should work. There is a quirk where the web token doesn't always pass back to us correctly on the first try. Upon success, you will be able to click the blue "Download Config File" button, which contains your initial kubernetes config file. Copy this file to your home directory as ~/.kube/config. Follow the directions on the web page to insert your '''"namespace:"''' line as directed. We will let you know which namespace to use. ==Testing Connectivity== Once your ~/.kube/config file is set up correctly, you should be able to connect to the cluster. All our shared servers here at the Genomics Institute have the 'kubectl' command installed on them, but if you are coming from somewhere else, just make sure the "kubectl" utility is installed on that machine. A quick test should go as follows: $ kubectl get nodes NAME STATUS ROLES AGE VERSION k1.kube Ready <none> 13h v1.15.3 k2.kube Ready <none> 13h v1.15.3 k3.kube Ready <none> 13h v1.15.3 master.kube Ready master 13h v1.15.3 ==Running Pods and Jobs with Requests and Limits== When running jobs and pods on kubernetes, you will always want to specify "requests" and "limits" on resources, otherwise your pods will get stuck with the default limits which are tiny (to protect against runaway pods). You should always have an idea of how much resources will be consumed by your jobs, and not use much more than that, in order not to hog all the resources. It also prevents your jib from "running away" unexpectedly and chewing up more resources than expected. Here is a good example of a job file that specifies limits: job.yml apiVersion: batch/v1 kind: Job metadata: name: $USER-$TS spec: backoffLimit: 0 ttlSecondsAfterFinished: 30 template: spec: containers: - name: magic image: robcurrie/ubuntu imagePullPolicy: Always resources: requests: cpu: "1" memory: "2G" ephemeral-storage: "2G" limits: cpu: "1" memory: "2G" ephemeral-storage: "3G" command: ["/bin/bash", "-c"] args: ['for i in {1..100}; do echo "$i: $(date)"; i=$((i+1)); sleep 1; done'] restartPolicy: Never Please note that the "request" and "limit" item fields should be the same. You would think that you could set the limit higher than the request, but in reality they need to match in order for the pod to stay within the kubernetes resource limit bubble. If you set the limit higher than the request, then you are risking the pod using more memory than the scheduler expects, and the node can start killing off random other colocated pods by way of OOM, which is very, very bad for the cluster. '''NOTE:''' Jobs and pods that have '''completed''' over 72 hours ago but have not been cleaned up will be automatically removed by the garbage collector. Most jobs will have the "ttlSecondsAfterFinished" configuration item in them, so they will automatically cleaned up after that time expires, but leaving the old pods and jobs around pins the disk space they were using while they remain, so it's good to get rid of them as soon as they are done unless you are debugging a failure or something like that. Jobs that '''run''' over 72 hours will not be deleted, only the ones that have '''exited''' over 72 hours ago. A lot of other good information can be viewed on Rob Currie's github page, which includes examples and some "How To" documentation: https://github.com/rcurrie/kubernetes ==View the Cluster's Current Activity== One quick way to check the cluster's utilization is to do: kubectl top nodes NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% udp-k8s-1 1815m 1% 1191Mi 0% udp-k8s-master 111m 5% 1024Mi 46% Ignore the master node as that one only handles cluster management and doesn't run jobs or pods for users. Another good way to get a lot of details about the current state of the cluster is through the Kubernetes Dashboard: https://cgl-k8s-dashboard.gi.ucsc.edu/ Select the "token" login method, and paste in this (long) token: eyJhbGciOiJSUzI1NiIsImtpZCI6ImY3MmxpdUQzb1dSUUp0NlFVUlI1blJVaUItX1pUX2JJMGRhY0ZOc3B6MDQifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJrdWJlcm5ldGVzLWRhc2hib2FyZCIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJrdWJlcm5ldGVzLWRhc2hib2FyZC10b2tlbi14OG52ZCIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50Lm5hbWUiOiJrdWJlcm5ldGVzLWRhc2hib2FyZCIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50LnVpZCI6IjAwZDQ1ZTI1LWU5MmMtNDMxZC1iYjE2LWQ3ZWViZTRkNDgzMiIsInN1YiI6InN5c3RlbTpzZXJ2aWNlYWNjb3VudDprdWJlcm5ldGVzLWRhc2hib2FyZDprdWJlcm5ldGVzLWRhc2hib2FyZCJ9.Ty9vSYU59_uqBar9OSc0wlenhGm1-aSUCPb8SZf6nhE8VVSt4TPCrXeL2SEsI_u6JAEeOBVJvVof52XSoU84RM8-e3ZWmr57LfjlEh5tPyJXPijCR_x3K0fXV-vpUUV69s7PHoLIy8UaoXOGbxm0O_731fnMenNtNbDDiWXjW9mXhklUG9mxDEipfKW76B_ZmuEkYuAP6BiNPuYc1K6x3m5x4QpkLe3MhBi0tTCkG5q1RU8S63FE3deRcl7VVvGoCENPq9vMJpOEqsVDBotEGwOca4UG7cCOeSSHwOz2aLHeP0CXZehWp9d7GggnrnknKJHrtXZ7-WiIABSPe2GJow The dashboard is read-only, so you won't be able to edit anything, it's mostly for seeing what's going on and where. f68c14020cb59112a94e69a814b554804585416d 172 171 2020-01-14T23:50:11Z Weiler 3 /* View the Cluster's Current Activity */ wikitext text/x-wiki __TOC__ The Undiagnosed Disease Project (UDP) has a Kubernetes Cluster running on one large GPU server. The current cluster makeup includes one worker node with the following specs: * 72 CPU cores (3.1 GHz) * 384 GB RAM * 3.2 TB Local NVMe Flash Storage * 4 NVIDIA GPUs * 25 Gb/s Network Interface ==Getting Authorized to Connect== If you require access to this kubernetes cluster, contact Benedict Paten asking for permission to use it, then pass on that permission via email to: cluster-admin@soe.ucsc.edu Let us know which group you are with and we can authorize you to use the cluster in the correct namespace. '''NOTE:''' You need GI VPN Access to access this kubernetes installation. ==Authenticating to Kubernetes== We will authorize (authz) you to use the cluster on the server side, but you will also need to authenticate (authn) using your '@ucsc.edu' email address and a unique Java Web token. These credentials are installed in ~/.kube/config in whatever machine you are coming from to get to the cluster. To authenticate and get your base kubernetes configuration, go to this URL (below), which will ask you to authenticate to Google. Use your '@ucsc.edu' email address as the login. It will then ask you to authenticate via CruzID Gold if your web browser doesn't already have the authentication token cached: https://udp-kube-auth.gi.ucsc.edu Once you authenticate (via username/password and 2-factor auth for CruzID Gold), it will pass you back to the 'https://udp-kube-auth.gi.ucsc.edu' website and it should confirm authentication on the top with a message saying "Successfully Authenticated". '''If you see any errors in red,''' but are sure you typed in your password and 2-factor auth correctly, click on the above link again (https://udp-kube-auth.gi.ucsc.edu) and authenticate a second time, which should work. There is a quirk where the web token doesn't always pass back to us correctly on the first try. Upon success, you will be able to click the blue "Download Config File" button, which contains your initial kubernetes config file. Copy this file to your home directory as ~/.kube/config. Follow the directions on the web page to insert your '''"namespace:"''' line as directed. We will let you know which namespace to use. ==Testing Connectivity== Once your ~/.kube/config file is set up correctly, you should be able to connect to the cluster. All our shared servers here at the Genomics Institute have the 'kubectl' command installed on them, but if you are coming from somewhere else, just make sure the "kubectl" utility is installed on that machine. A quick test should go as follows: $ kubectl get nodes NAME STATUS ROLES AGE VERSION k1.kube Ready <none> 13h v1.15.3 k2.kube Ready <none> 13h v1.15.3 k3.kube Ready <none> 13h v1.15.3 master.kube Ready master 13h v1.15.3 ==Running Pods and Jobs with Requests and Limits== When running jobs and pods on kubernetes, you will always want to specify "requests" and "limits" on resources, otherwise your pods will get stuck with the default limits which are tiny (to protect against runaway pods). You should always have an idea of how much resources will be consumed by your jobs, and not use much more than that, in order not to hog all the resources. It also prevents your jib from "running away" unexpectedly and chewing up more resources than expected. Here is a good example of a job file that specifies limits: job.yml apiVersion: batch/v1 kind: Job metadata: name: $USER-$TS spec: backoffLimit: 0 ttlSecondsAfterFinished: 30 template: spec: containers: - name: magic image: robcurrie/ubuntu imagePullPolicy: Always resources: requests: cpu: "1" memory: "2G" ephemeral-storage: "2G" limits: cpu: "1" memory: "2G" ephemeral-storage: "3G" command: ["/bin/bash", "-c"] args: ['for i in {1..100}; do echo "$i: $(date)"; i=$((i+1)); sleep 1; done'] restartPolicy: Never Please note that the "request" and "limit" item fields should be the same. You would think that you could set the limit higher than the request, but in reality they need to match in order for the pod to stay within the kubernetes resource limit bubble. If you set the limit higher than the request, then you are risking the pod using more memory than the scheduler expects, and the node can start killing off random other colocated pods by way of OOM, which is very, very bad for the cluster. '''NOTE:''' Jobs and pods that have '''completed''' over 72 hours ago but have not been cleaned up will be automatically removed by the garbage collector. Most jobs will have the "ttlSecondsAfterFinished" configuration item in them, so they will automatically cleaned up after that time expires, but leaving the old pods and jobs around pins the disk space they were using while they remain, so it's good to get rid of them as soon as they are done unless you are debugging a failure or something like that. Jobs that '''run''' over 72 hours will not be deleted, only the ones that have '''exited''' over 72 hours ago. A lot of other good information can be viewed on Rob Currie's github page, which includes examples and some "How To" documentation: https://github.com/rcurrie/kubernetes ==View the Cluster's Current Activity== One quick way to check the cluster's utilization is to do: kubectl top nodes NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% udp-k8s-1 1815m 1% 1191Mi 0% udp-k8s-master 111m 5% 1024Mi 46% Ignore the master node as that one only handles cluster management and doesn't run jobs or pods for users. Another good way to get a lot of details about the current state of the cluster is through the Kubernetes Dashboard: https://udp-k8s-dashboard.gi.ucsc.edu/ Note that you need to be connected to the VPN for that URL to work. Select the "token" login method, and paste in this (long) token: eyJhbGciOiJSUzI1NiIsImtpZCI6ImY3MmxpdUQzb1dSUUp0NlFVUlI1blJVaUItX1pUX2JJMGRhY0ZOc3B6MDQifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJrdWJlcm5ldGVzLWRhc2hib2FyZCIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJrdWJlcm5ldGVzLWRhc2hib2FyZC10b2tlbi14OG52ZCIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50Lm5hbWUiOiJrdWJlcm5ldGVzLWRhc2hib2FyZCIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50LnVpZCI6IjAwZDQ1ZTI1LWU5MmMtNDMxZC1iYjE2LWQ3ZWViZTRkNDgzMiIsInN1YiI6InN5c3RlbTpzZXJ2aWNlYWNjb3VudDprdWJlcm5ldGVzLWRhc2hib2FyZDprdWJlcm5ldGVzLWRhc2hib2FyZCJ9.Ty9vSYU59_uqBar9OSc0wlenhGm1-aSUCPb8SZf6nhE8VVSt4TPCrXeL2SEsI_u6JAEeOBVJvVof52XSoU84RM8-e3ZWmr57LfjlEh5tPyJXPijCR_x3K0fXV-vpUUV69s7PHoLIy8UaoXOGbxm0O_731fnMenNtNbDDiWXjW9mXhklUG9mxDEipfKW76B_ZmuEkYuAP6BiNPuYc1K6x3m5x4QpkLe3MhBi0tTCkG5q1RU8S63FE3deRcl7VVvGoCENPq9vMJpOEqsVDBotEGwOca4UG7cCOeSSHwOz2aLHeP0CXZehWp9d7GggnrnknKJHrtXZ7-WiIABSPe2GJow The dashboard is read-only, so you won't be able to edit anything, it's mostly for seeing what's going on and where. c6a8f24c0c9fb11a0931cdd0bb43eb37d329e74c Quick Start Instructions to Get Rolling with OpenStack 0 26 182 2020-04-28T17:01:51Z Weiler 3 Created page with "__TOC__ ==Request an OpenStack Account== Once you have PRISM/GI VPN access, you can request an OpenStack account. You will need to send an email to cluster-admin@soe.ucsc.e..." wikitext text/x-wiki __TOC__ ==Request an OpenStack Account== Once you have PRISM/GI VPN access, you can request an OpenStack account. You will need to send an email to cluster-admin@soe.ucsc.edu asking for access, and let us know which lab you are in, or who your PI is, so we can place you in the right OpenStack group. 6879850617b770beee357134558c3a599529671e 183 182 2020-04-28T17:04:12Z Weiler 3 /* Request an OpenStack Account */ wikitext text/x-wiki __TOC__ ==Request an OpenStack Account== Once you have [[PRISM/GI VPN access]], you can request an OpenStack account. You will need to send an email to cluster-admin@soe.ucsc.edu asking for access, and let us know which lab you are in, or who your PI is, so we can place you in the right OpenStack group. 9c8593c792ea5c84588c1a8cf57c44c4babb9d3c 184 183 2020-04-28T17:05:49Z Weiler 3 wikitext text/x-wiki __TOC__ ==Request an OpenStack Account== Once you have [http://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access PRISM/GI VPN access], you can request an OpenStack account. You will need to send an email to cluster-admin@soe.ucsc.edu asking for access, and let us know which lab you are in, or who your PI is, so we can place you in the right OpenStack group. 4f9f3a2d6667c8d81b959500b7ef51ee26577520 185 184 2020-04-28T17:57:12Z Weiler 3 wikitext text/x-wiki __TOC__ ==Request an OpenStack Account== Once you have [http://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access PRISM/GI VPN access], you can request an OpenStack account. You will need to send an email to cluster-admin@soe.ucsc.edu asking for access, and let us know which lab you are in, or who your PI is, so we can place you in the right OpenStack group. ==Log In To giCloud== Once you have been notified that your account has been set up and have been given login credentials, connect to the VPN and then go to this link, which is the login page: https://gicloud.prism To login, enter your username and password. Also you will see a "Domain" field, just enter the word "default" for the domain. Click "Log In". You will be logged into your group's summary page. 10cdb2c1db957d1673269688464ccf74ae1a3942 186 185 2020-04-28T19:35:23Z Weiler 3 wikitext text/x-wiki __TOC__ ==Request an OpenStack Account== Once you have [http://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access PRISM/GI VPN access], you can request an OpenStack account. You will need to send an email to cluster-admin@soe.ucsc.edu asking for access, and let us know which lab you are in, or who your PI is, so we can place you in the right OpenStack group. ==Log In To giCloud== Once you have been notified that your account has been set up and have been given login credentials, connect to the VPN and then go to this link, which is the login page: https://gicloud.prism To login, enter your username and password. Also you will see a "Domain" field, just enter the word "default" for the domain. Click "Log In". You will be logged into your group's summary page. ==Create a SSH Public/Private Keypair== To log into an OpenStack VM instance, you will need a SSH public key. The key is "injected" into the instance upon creation, and only someone with that key (i.e. you) will be able to log in via SSH initially. If you already have a SSH public and private key that you use elsewhere, you can use that one, and can skip to the next step. If you don't have a SSH keypair set up yet, then you will need to log into the UNIX compatible machine you will be '''logging in from''', and then run the 'ssh-keygen' command. It will look something like this: $ ssh-keygen -t rsa Generating public/private rsa key pair. Enter file in which to save the key (/public/home/frank/.ssh/id_rsa): Created directory '/public/home/frank/.ssh'. Enter passphrase (empty for no passphrase): [JUST HIT ENTER] Enter same passphrase again: [JUST HIT ENTER] Your identification has been saved in /public/home/frank/.ssh/id_rsa. Your public key has been saved in /public/home/frank/.ssh/id_rsa.pub. The key fingerprint is: SHA256:dhJG1A3gcwj7Mz17ommt3NIczMVVgrzp8Tf6F1X4jpI The key's randomart image is: +---[RSA 2048]----+ | ..+o.+ ..o.| | = .. + o..| | . * .. + ..| | o = * o| | So+o + o.| | . =+oE ooo| | +o.....o| | .o++o . .| | .=o. ...| +----[SHA256]-----+ You will then have a new directory, "~/.ssh", and inside that directory you will have a file called "id_rsa.pub". That is your SSH public key. You will need this in the next step in order to set up your key in OpenStack. 57c93e55a9a0894d0059716b9d47accd4e72138f 187 186 2020-04-28T19:37:28Z Weiler 3 wikitext text/x-wiki __TOC__ ==Request an OpenStack Account== Once you have [http://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access PRISM/GI VPN access], you can request an OpenStack account. You will need to send an email to cluster-admin@soe.ucsc.edu asking for access, and let us know which lab you are in, or who your PI is, so we can place you in the right OpenStack group. ==Log In To giCloud== Once you have been notified that your account has been set up and have been given login credentials, connect to the VPN and then go to this link, which is the login page: https://gicloud.prism To login, enter your username and password. Also you will see a "Domain" field, just enter the word "default" for the domain. Click "Log In". You will be logged into your group's summary page. ==Create a SSH Public/Private Keypair== To log into an OpenStack VM instance, you will need a SSH public key. The key is "injected" into the instance upon creation, and only someone with that key (i.e. you) will be able to log in via SSH initially. If you already have a SSH public and private key that you use elsewhere, you can use that one, and can skip to the next step. If you don't have a SSH keypair set up yet, then you will need to log into the UNIX compatible machine you will be '''logging in from''' (a Mac/Apple computer will also work), and then run the 'ssh-keygen' command. If you are behind the VPN, you can first log into mustard, crimson or razzmatazz, which are linux servers. The command will look something like this: $ ssh-keygen -t rsa Generating public/private rsa key pair. Enter file in which to save the key (/public/home/frank/.ssh/id_rsa): Created directory '/public/home/frank/.ssh'. Enter passphrase (empty for no passphrase): [JUST HIT ENTER] Enter same passphrase again: [JUST HIT ENTER] Your identification has been saved in /public/home/frank/.ssh/id_rsa. Your public key has been saved in /public/home/frank/.ssh/id_rsa.pub. The key fingerprint is: SHA256:dhJG1A3gcwj7Mz17ommt3NIczMVVgrzp8Tf6F1X4jpI The key's randomart image is: +---[RSA 2048]----+ | ..+o.+ ..o.| | = .. + o..| | . * .. + ..| | o = * o| | So+o + o.| | . =+oE ooo| | +o.....o| | .o++o . .| | .=o. ...| +----[SHA256]-----+ You will then have a new directory, "~/.ssh", and inside that directory you will have a file called "id_rsa.pub". That is your SSH public key. You will need this in the next step in order to set up your key in OpenStack. 0bfb0daf5874fe68f89f4f4611d761d9a4fb9f56 188 187 2020-04-28T19:37:46Z Weiler 3 wikitext text/x-wiki __TOC__ ==Request an OpenStack Account== Once you have [http://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access PRISM/GI VPN access], you can request an OpenStack account. You will need to send an email to cluster-admin@soe.ucsc.edu asking for access, and let us know which lab you are in, or who your PI is, so we can place you in the right OpenStack group. ==Log In To giCloud== Once you have been notified that your account has been set up and have been given login credentials, connect to the VPN and then go to this link, which is the login page: https://gicloud.prism To login, enter your username and password. Also you will see a "Domain" field, just enter the word "default" for the domain. Click "Log In". You will be logged into your group's summary page. ==Create a SSH Public/Private Keypair== To log into an OpenStack VM instance, you will need a SSH public key. The key is "injected" into the instance upon creation, and only someone with that key (i.e. you) will be able to log in via SSH initially. If you already have a SSH public and private key that you use elsewhere, you can use that one, and can skip to the next step. If you don't have a SSH keypair set up yet, then you will need to log into the UNIX compatible machine you will be '''logging in from''' (a Mac/Apple computer will also work), and then run the 'ssh-keygen' command. If you are behind the VPN, you can first log into mustard, crimson or razzmatazz, which are linux servers. The command will look something like this: $ ssh-keygen -t rsa Generating public/private rsa key pair. Enter file in which to save the key (/public/home/frank/.ssh/id_rsa): Created directory '/public/home/frank/.ssh'. Enter passphrase (empty for no passphrase): [JUST HIT ENTER] Enter same passphrase again: [JUST HIT ENTER] Your identification has been saved in /public/home/frank/.ssh/id_rsa. Your public key has been saved in /public/home/frank/.ssh/id_rsa.pub. The key fingerprint is: SHA256:dhJG1A3gcwj7Mz17ommt3NIczMVVgrzp8Tf6F1X4jpI The key's randomart image is: +---[RSA 2048]----+ | ..+o.+ ..o.| | = .. + o..| | . * .. + ..| | o = * o| | So+o + o.| | . =+oE ooo| | +o.....o| | .o++o . .| | .=o. ...| +----[SHA256]-----+ You will then have a new directory, "~/.ssh", and inside that directory you will have a file called "id_rsa.pub". That is your SSH public key. You will need this in the next step in order to set up your key in OpenStack. 43b0128101d2dfc6cdc68091a9ba1d6aa1e752cb 189 188 2020-04-28T20:30:35Z Weiler 3 wikitext text/x-wiki __TOC__ ==Request an OpenStack Account== Once you have [http://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access PRISM/GI VPN access], you can request an OpenStack account. You will need to send an email to cluster-admin@soe.ucsc.edu asking for access, and let us know which lab you are in, or who your PI is, so we can place you in the right OpenStack group. ==Create a SSH Public/Private Keypair== To log into an OpenStack VM instance, you will need a SSH public key. The key is "injected" into the instance upon creation, and only someone with that key (i.e. you) will be able to log in via SSH initially. If you already have a SSH public and private key that you use elsewhere, you can use that one, and can skip to the next step. If you don't have a SSH keypair set up yet, then you will need to log into the UNIX compatible machine you will be '''logging in from''' (a Mac/Apple computer will also work), and then run the 'ssh-keygen' command. If you are behind the VPN, you can first log into mustard, crimson or razzmatazz, which are linux servers. The command will look something like this: $ ssh-keygen -t rsa Generating public/private rsa key pair. Enter file in which to save the key (/public/home/frank/.ssh/id_rsa): Created directory '/public/home/frank/.ssh'. Enter passphrase (empty for no passphrase): [JUST HIT ENTER] Enter same passphrase again: [JUST HIT ENTER] Your identification has been saved in /public/home/frank/.ssh/id_rsa. Your public key has been saved in /public/home/frank/.ssh/id_rsa.pub. The key fingerprint is: SHA256:dhJG1A3gcwj7Mz17ommt3NIczMVVgrzp8Tf6F1X4jpI The key's randomart image is: +---[RSA 2048]----+ | ..+o.+ ..o.| | = .. + o..| | . * .. + ..| | o = * o| | So+o + o.| | . =+oE ooo| | +o.....o| | .o++o . .| | .=o. ...| +----[SHA256]-----+ You will then have a new directory, "~/.ssh", and inside that directory you will have a file called "id_rsa.pub". That is your SSH public key. You will need this in the next step in order to set up your key in OpenStack. ==Log In To giCloud== Once you have been notified that your account has been set up and have been given login credentials, connect to the VPN and then go to this link, which is the login page: https://gicloud.prism To login, enter your username and password. Also you will see a "Domain" field, just enter the word "default" for the domain. Click "Log In". You will be logged into your group's summary page. ==Upload your SSH Public Key== After creating your new key in the above "Create a SSH Public/Private Keypair" step, you will need to upload that key into OpenStack. Once you are logged in, on the left hand navigation menu, click "Project", then in the submenu, select "Compute", and finally select "Key Pairs". It should take you to the "Key Pairs" window as shown here. fa424ea1a21d3ae9d8f7315ad18e355e672b4b9e 191 189 2020-04-28T20:32:40Z Weiler 3 wikitext text/x-wiki __TOC__ ==Request an OpenStack Account== Once you have [http://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access PRISM/GI VPN access], you can request an OpenStack account. You will need to send an email to cluster-admin@soe.ucsc.edu asking for access, and let us know which lab you are in, or who your PI is, so we can place you in the right OpenStack group. ==Create a SSH Public/Private Keypair== To log into an OpenStack VM instance, you will need a SSH public key. The key is "injected" into the instance upon creation, and only someone with that key (i.e. you) will be able to log in via SSH initially. If you already have a SSH public and private key that you use elsewhere, you can use that one, and can skip to the next step. If you don't have a SSH keypair set up yet, then you will need to log into the UNIX compatible machine you will be '''logging in from''' (a Mac/Apple computer will also work), and then run the 'ssh-keygen' command. If you are behind the VPN, you can first log into mustard, crimson or razzmatazz, which are linux servers. The command will look something like this: $ ssh-keygen -t rsa Generating public/private rsa key pair. Enter file in which to save the key (/public/home/frank/.ssh/id_rsa): Created directory '/public/home/frank/.ssh'. Enter passphrase (empty for no passphrase): [JUST HIT ENTER] Enter same passphrase again: [JUST HIT ENTER] Your identification has been saved in /public/home/frank/.ssh/id_rsa. Your public key has been saved in /public/home/frank/.ssh/id_rsa.pub. The key fingerprint is: SHA256:dhJG1A3gcwj7Mz17ommt3NIczMVVgrzp8Tf6F1X4jpI The key's randomart image is: +---[RSA 2048]----+ | ..+o.+ ..o.| | = .. + o..| | . * .. + ..| | o = * o| | So+o + o.| | . =+oE ooo| | +o.....o| | .o++o . .| | .=o. ...| +----[SHA256]-----+ You will then have a new directory, "~/.ssh", and inside that directory you will have a file called "id_rsa.pub". That is your SSH public key. You will need this in the next step in order to set up your key in OpenStack. ==Log In To giCloud== Once you have been notified that your account has been set up and have been given login credentials, connect to the VPN and then go to this link, which is the login page: https://gicloud.prism To login, enter your username and password. Also you will see a "Domain" field, just enter the word "default" for the domain. Click "Log In". You will be logged into your group's summary page. ==Upload your SSH Public Key== After creating your new key in the above "Create a SSH Public/Private Keypair" step, you will need to upload that key into OpenStack. Once you are logged in, on the left hand navigation menu, click "Project", then in the submenu, select "Compute", and finally select "Key Pairs". It should take you to the "Key Pairs" window as shown here. [[File:Keypairs.png]] 5ac3b048cf4e52d10e581bb57443e9e9466e650f 192 191 2020-04-28T20:34:31Z Weiler 3 /* Upload your SSH Public Key */ wikitext text/x-wiki __TOC__ ==Request an OpenStack Account== Once you have [http://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access PRISM/GI VPN access], you can request an OpenStack account. You will need to send an email to cluster-admin@soe.ucsc.edu asking for access, and let us know which lab you are in, or who your PI is, so we can place you in the right OpenStack group. ==Create a SSH Public/Private Keypair== To log into an OpenStack VM instance, you will need a SSH public key. The key is "injected" into the instance upon creation, and only someone with that key (i.e. you) will be able to log in via SSH initially. If you already have a SSH public and private key that you use elsewhere, you can use that one, and can skip to the next step. If you don't have a SSH keypair set up yet, then you will need to log into the UNIX compatible machine you will be '''logging in from''' (a Mac/Apple computer will also work), and then run the 'ssh-keygen' command. If you are behind the VPN, you can first log into mustard, crimson or razzmatazz, which are linux servers. The command will look something like this: $ ssh-keygen -t rsa Generating public/private rsa key pair. Enter file in which to save the key (/public/home/frank/.ssh/id_rsa): Created directory '/public/home/frank/.ssh'. Enter passphrase (empty for no passphrase): [JUST HIT ENTER] Enter same passphrase again: [JUST HIT ENTER] Your identification has been saved in /public/home/frank/.ssh/id_rsa. Your public key has been saved in /public/home/frank/.ssh/id_rsa.pub. The key fingerprint is: SHA256:dhJG1A3gcwj7Mz17ommt3NIczMVVgrzp8Tf6F1X4jpI The key's randomart image is: +---[RSA 2048]----+ | ..+o.+ ..o.| | = .. + o..| | . * .. + ..| | o = * o| | So+o + o.| | . =+oE ooo| | +o.....o| | .o++o . .| | .=o. ...| +----[SHA256]-----+ You will then have a new directory, "~/.ssh", and inside that directory you will have a file called "id_rsa.pub". That is your SSH public key. You will need this in the next step in order to set up your key in OpenStack. ==Log In To giCloud== Once you have been notified that your account has been set up and have been given login credentials, connect to the VPN and then go to this link, which is the login page: https://gicloud.prism To login, enter your username and password. Also you will see a "Domain" field, just enter the word "default" for the domain. Click "Log In". You will be logged into your group's summary page. ==Upload your SSH Public Key== After creating your new key in the above "Create a SSH Public/Private Keypair" step, you will need to upload that key into OpenStack. Once you are logged in, on the left hand navigation menu, click "Project", then in the submenu, select "Compute", and finally select "Key Pairs". It should take you to the "Key Pairs" window as shown here. [[File:Keypairs.png|upright]] 4a3d97db0d6e994285eabac2d25ac46ad15cbe09 193 192 2020-04-28T20:35:17Z Weiler 3 /* Upload your SSH Public Key */ wikitext text/x-wiki __TOC__ ==Request an OpenStack Account== Once you have [http://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access PRISM/GI VPN access], you can request an OpenStack account. You will need to send an email to cluster-admin@soe.ucsc.edu asking for access, and let us know which lab you are in, or who your PI is, so we can place you in the right OpenStack group. ==Create a SSH Public/Private Keypair== To log into an OpenStack VM instance, you will need a SSH public key. The key is "injected" into the instance upon creation, and only someone with that key (i.e. you) will be able to log in via SSH initially. If you already have a SSH public and private key that you use elsewhere, you can use that one, and can skip to the next step. If you don't have a SSH keypair set up yet, then you will need to log into the UNIX compatible machine you will be '''logging in from''' (a Mac/Apple computer will also work), and then run the 'ssh-keygen' command. If you are behind the VPN, you can first log into mustard, crimson or razzmatazz, which are linux servers. The command will look something like this: $ ssh-keygen -t rsa Generating public/private rsa key pair. Enter file in which to save the key (/public/home/frank/.ssh/id_rsa): Created directory '/public/home/frank/.ssh'. Enter passphrase (empty for no passphrase): [JUST HIT ENTER] Enter same passphrase again: [JUST HIT ENTER] Your identification has been saved in /public/home/frank/.ssh/id_rsa. Your public key has been saved in /public/home/frank/.ssh/id_rsa.pub. The key fingerprint is: SHA256:dhJG1A3gcwj7Mz17ommt3NIczMVVgrzp8Tf6F1X4jpI The key's randomart image is: +---[RSA 2048]----+ | ..+o.+ ..o.| | = .. + o..| | . * .. + ..| | o = * o| | So+o + o.| | . =+oE ooo| | +o.....o| | .o++o . .| | .=o. ...| +----[SHA256]-----+ You will then have a new directory, "~/.ssh", and inside that directory you will have a file called "id_rsa.pub". That is your SSH public key. You will need this in the next step in order to set up your key in OpenStack. ==Log In To giCloud== Once you have been notified that your account has been set up and have been given login credentials, connect to the VPN and then go to this link, which is the login page: https://gicloud.prism To login, enter your username and password. Also you will see a "Domain" field, just enter the word "default" for the domain. Click "Log In". You will be logged into your group's summary page. ==Upload your SSH Public Key== After creating your new key in the above "Create a SSH Public/Private Keypair" step, you will need to upload that key into OpenStack. Once you are logged in, on the left hand navigation menu, click "Project", then in the submenu, select "Compute", and finally select "Key Pairs". It should take you to the "Key Pairs" window as shown here. [[File:Keypairs.png|upright=1.0]] 7f0c906a3eec4ab2810b6cc4e362ff742740a7b7 194 193 2020-04-28T20:36:48Z Weiler 3 /* Upload your SSH Public Key */ wikitext text/x-wiki __TOC__ ==Request an OpenStack Account== Once you have [http://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access PRISM/GI VPN access], you can request an OpenStack account. You will need to send an email to cluster-admin@soe.ucsc.edu asking for access, and let us know which lab you are in, or who your PI is, so we can place you in the right OpenStack group. ==Create a SSH Public/Private Keypair== To log into an OpenStack VM instance, you will need a SSH public key. The key is "injected" into the instance upon creation, and only someone with that key (i.e. you) will be able to log in via SSH initially. If you already have a SSH public and private key that you use elsewhere, you can use that one, and can skip to the next step. If you don't have a SSH keypair set up yet, then you will need to log into the UNIX compatible machine you will be '''logging in from''' (a Mac/Apple computer will also work), and then run the 'ssh-keygen' command. If you are behind the VPN, you can first log into mustard, crimson or razzmatazz, which are linux servers. The command will look something like this: $ ssh-keygen -t rsa Generating public/private rsa key pair. Enter file in which to save the key (/public/home/frank/.ssh/id_rsa): Created directory '/public/home/frank/.ssh'. Enter passphrase (empty for no passphrase): [JUST HIT ENTER] Enter same passphrase again: [JUST HIT ENTER] Your identification has been saved in /public/home/frank/.ssh/id_rsa. Your public key has been saved in /public/home/frank/.ssh/id_rsa.pub. The key fingerprint is: SHA256:dhJG1A3gcwj7Mz17ommt3NIczMVVgrzp8Tf6F1X4jpI The key's randomart image is: +---[RSA 2048]----+ | ..+o.+ ..o.| | = .. + o..| | . * .. + ..| | o = * o| | So+o + o.| | . =+oE ooo| | +o.....o| | .o++o . .| | .=o. ...| +----[SHA256]-----+ You will then have a new directory, "~/.ssh", and inside that directory you will have a file called "id_rsa.pub". That is your SSH public key. You will need this in the next step in order to set up your key in OpenStack. ==Log In To giCloud== Once you have been notified that your account has been set up and have been given login credentials, connect to the VPN and then go to this link, which is the login page: https://gicloud.prism To login, enter your username and password. Also you will see a "Domain" field, just enter the word "default" for the domain. Click "Log In". You will be logged into your group's summary page. ==Upload your SSH Public Key== After creating your new key in the above "Create a SSH Public/Private Keypair" step, you will need to upload that key into OpenStack. Once you are logged in, on the left hand navigation menu, click "Project", then in the submenu, select "Compute", and finally select "Key Pairs". It should take you to the "Key Pairs" window as shown here. [[File:Keypairs.png|upright=400px]] 44dbc1910266336d835fd5735ea2d59b36bbc454 195 194 2020-04-28T20:37:04Z Weiler 3 /* Upload your SSH Public Key */ wikitext text/x-wiki __TOC__ ==Request an OpenStack Account== Once you have [http://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access PRISM/GI VPN access], you can request an OpenStack account. You will need to send an email to cluster-admin@soe.ucsc.edu asking for access, and let us know which lab you are in, or who your PI is, so we can place you in the right OpenStack group. ==Create a SSH Public/Private Keypair== To log into an OpenStack VM instance, you will need a SSH public key. The key is "injected" into the instance upon creation, and only someone with that key (i.e. you) will be able to log in via SSH initially. If you already have a SSH public and private key that you use elsewhere, you can use that one, and can skip to the next step. If you don't have a SSH keypair set up yet, then you will need to log into the UNIX compatible machine you will be '''logging in from''' (a Mac/Apple computer will also work), and then run the 'ssh-keygen' command. If you are behind the VPN, you can first log into mustard, crimson or razzmatazz, which are linux servers. The command will look something like this: $ ssh-keygen -t rsa Generating public/private rsa key pair. Enter file in which to save the key (/public/home/frank/.ssh/id_rsa): Created directory '/public/home/frank/.ssh'. Enter passphrase (empty for no passphrase): [JUST HIT ENTER] Enter same passphrase again: [JUST HIT ENTER] Your identification has been saved in /public/home/frank/.ssh/id_rsa. Your public key has been saved in /public/home/frank/.ssh/id_rsa.pub. The key fingerprint is: SHA256:dhJG1A3gcwj7Mz17ommt3NIczMVVgrzp8Tf6F1X4jpI The key's randomart image is: +---[RSA 2048]----+ | ..+o.+ ..o.| | = .. + o..| | . * .. + ..| | o = * o| | So+o + o.| | . =+oE ooo| | +o.....o| | .o++o . .| | .=o. ...| +----[SHA256]-----+ You will then have a new directory, "~/.ssh", and inside that directory you will have a file called "id_rsa.pub". That is your SSH public key. You will need this in the next step in order to set up your key in OpenStack. ==Log In To giCloud== Once you have been notified that your account has been set up and have been given login credentials, connect to the VPN and then go to this link, which is the login page: https://gicloud.prism To login, enter your username and password. Also you will see a "Domain" field, just enter the word "default" for the domain. Click "Log In". You will be logged into your group's summary page. ==Upload your SSH Public Key== After creating your new key in the above "Create a SSH Public/Private Keypair" step, you will need to upload that key into OpenStack. Once you are logged in, on the left hand navigation menu, click "Project", then in the submenu, select "Compute", and finally select "Key Pairs". It should take you to the "Key Pairs" window as shown here. [[File:Keypairs.png|400px]] 67ba0f79d5922449fbe817380c07e3e12d05549b 196 195 2020-04-28T20:37:26Z Weiler 3 /* Upload your SSH Public Key */ wikitext text/x-wiki __TOC__ ==Request an OpenStack Account== Once you have [http://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access PRISM/GI VPN access], you can request an OpenStack account. You will need to send an email to cluster-admin@soe.ucsc.edu asking for access, and let us know which lab you are in, or who your PI is, so we can place you in the right OpenStack group. ==Create a SSH Public/Private Keypair== To log into an OpenStack VM instance, you will need a SSH public key. The key is "injected" into the instance upon creation, and only someone with that key (i.e. you) will be able to log in via SSH initially. If you already have a SSH public and private key that you use elsewhere, you can use that one, and can skip to the next step. If you don't have a SSH keypair set up yet, then you will need to log into the UNIX compatible machine you will be '''logging in from''' (a Mac/Apple computer will also work), and then run the 'ssh-keygen' command. If you are behind the VPN, you can first log into mustard, crimson or razzmatazz, which are linux servers. The command will look something like this: $ ssh-keygen -t rsa Generating public/private rsa key pair. Enter file in which to save the key (/public/home/frank/.ssh/id_rsa): Created directory '/public/home/frank/.ssh'. Enter passphrase (empty for no passphrase): [JUST HIT ENTER] Enter same passphrase again: [JUST HIT ENTER] Your identification has been saved in /public/home/frank/.ssh/id_rsa. Your public key has been saved in /public/home/frank/.ssh/id_rsa.pub. The key fingerprint is: SHA256:dhJG1A3gcwj7Mz17ommt3NIczMVVgrzp8Tf6F1X4jpI The key's randomart image is: +---[RSA 2048]----+ | ..+o.+ ..o.| | = .. + o..| | . * .. + ..| | o = * o| | So+o + o.| | . =+oE ooo| | +o.....o| | .o++o . .| | .=o. ...| +----[SHA256]-----+ You will then have a new directory, "~/.ssh", and inside that directory you will have a file called "id_rsa.pub". That is your SSH public key. You will need this in the next step in order to set up your key in OpenStack. ==Log In To giCloud== Once you have been notified that your account has been set up and have been given login credentials, connect to the VPN and then go to this link, which is the login page: https://gicloud.prism To login, enter your username and password. Also you will see a "Domain" field, just enter the word "default" for the domain. Click "Log In". You will be logged into your group's summary page. ==Upload your SSH Public Key== After creating your new key in the above "Create a SSH Public/Private Keypair" step, you will need to upload that key into OpenStack. Once you are logged in, on the left hand navigation menu, click "Project", then in the submenu, select "Compute", and finally select "Key Pairs". It should take you to the "Key Pairs" window as shown here. [[File:Keypairs.png|800px]] 2334fb48cac5d4d18104cc1d68a6035517e69f33 197 196 2020-04-28T20:44:18Z Weiler 3 wikitext text/x-wiki __TOC__ ==Request an OpenStack Account== Once you have [http://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access PRISM/GI VPN access], you can request an OpenStack account. You will need to send an email to cluster-admin@soe.ucsc.edu asking for access, and let us know which lab you are in, or who your PI is, so we can place you in the right OpenStack group. ==Create a SSH Public/Private Keypair== To log into an OpenStack VM instance, you will need a SSH public key. The key is "injected" into the instance upon creation, and only someone with that key (i.e. you) will be able to log in via SSH initially. If you already have a SSH public and private key that you use elsewhere, you can use that one, and can skip to the next step. If you don't have a SSH keypair set up yet, then you will need to log into the UNIX compatible machine you will be '''logging in from''' (a Mac/Apple computer will also work), and then run the 'ssh-keygen' command. If you are behind the VPN, you can first log into mustard, crimson or razzmatazz, which are linux servers. The command will look something like this: $ ssh-keygen -t rsa Generating public/private rsa key pair. Enter file in which to save the key (/public/home/frank/.ssh/id_rsa): Created directory '/public/home/frank/.ssh'. Enter passphrase (empty for no passphrase): [JUST HIT ENTER] Enter same passphrase again: [JUST HIT ENTER] Your identification has been saved in /public/home/frank/.ssh/id_rsa. Your public key has been saved in /public/home/frank/.ssh/id_rsa.pub. The key fingerprint is: SHA256:dhJG1A3gcwj7Mz17ommt3NIczMVVgrzp8Tf6F1X4jpI The key's randomart image is: +---[RSA 2048]----+ | ..+o.+ ..o.| | = .. + o..| | . * .. + ..| | o = * o| | So+o + o.| | . =+oE ooo| | +o.....o| | .o++o . .| | .=o. ...| +----[SHA256]-----+ You will then have a new directory, "~/.ssh", and inside that directory you will have a file called "id_rsa.pub". That is your SSH public key. You will need this in the next step in order to set up your key in OpenStack. ==Log In To giCloud== Once you have been notified that your account has been set up and have been given login credentials, connect to the VPN and then go to this link, which is the login page: https://gicloud.prism To login, enter your username and password. Also you will see a "Domain" field, just enter the word "default" for the domain. Click "Log In". You will be logged into your group's summary page. ==Upload your SSH Public Key== After creating your new key in the above "Create a SSH Public/Private Keypair" step, you will need to upload that key into OpenStack. Once you are logged in, on the left hand navigation menu, click "Project", then in the submenu, select "Compute", and finally select "Key Pairs". It should take you to the "Key Pairs" window as shown here. [[File:Keypairs.png|900px]] Next click the "Import Public Key" button on the top right of the window. In the resulting window, name your key in the "Key Pair Name" field. Name it something descriptive like "laptop-key" if the key is on your laptop, or "mustard-key" if you are logged into mustard, etc. To get your key, open a terminal window and type "cat ~/.ssh/id_rsa.pub" to get your full key, as so: $ cat ~/.ssh/id_rsa.pub ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAyVKNfdBbDIk7Iq8JmL+u3vxAn4M1iaQgMU5tHJhMSAYBZEZRLZAFc+Qovxe5zzs1ixte9lCipLep39q2I4U8XND17nYliZ4HVM4MW4GsMUfKsgX2FI3mB2vAQ9pZSLkAhTg2D+92uALUSSv1cDZhTqo7DuPRX2Upxyd5QbRL6TRFswBjHz2vY/JpaPQm1S1d10mokPpmxehLfwp0mVgmz1Uv/6FflqiZ68DhDN67cs1yQgWYXQ01IHPjzTKRwCuZVkgT99rkoqy6TkAyrvsfzYPZbIA2y+ovOBzq6WCUT9gp5Jx/UE6CxLSmAuGPAJkV5D/twKIe75xc+5jdi3I1cgKw== user@laptop Copy that whole line, starting with "ssh-rsa" all the way through the very last character, including the "user@laptop" bit (which may be different for you, just be sure to include it in the line copy). Then back in the OpenStack Key Pair dialogue window, paste in the keypair in the "Public Key" window, then click "Import Key". The key should then appear in the key list. c7b0a31ba919c1f6f3e1331251f38da00eaf6109 198 197 2020-04-28T20:44:53Z Weiler 3 /* Log In To giCloud */ wikitext text/x-wiki __TOC__ ==Request an OpenStack Account== Once you have [http://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access PRISM/GI VPN access], you can request an OpenStack account. You will need to send an email to cluster-admin@soe.ucsc.edu asking for access, and let us know which lab you are in, or who your PI is, so we can place you in the right OpenStack group. ==Create a SSH Public/Private Keypair== To log into an OpenStack VM instance, you will need a SSH public key. The key is "injected" into the instance upon creation, and only someone with that key (i.e. you) will be able to log in via SSH initially. If you already have a SSH public and private key that you use elsewhere, you can use that one, and can skip to the next step. If you don't have a SSH keypair set up yet, then you will need to log into the UNIX compatible machine you will be '''logging in from''' (a Mac/Apple computer will also work), and then run the 'ssh-keygen' command. If you are behind the VPN, you can first log into mustard, crimson or razzmatazz, which are linux servers. The command will look something like this: $ ssh-keygen -t rsa Generating public/private rsa key pair. Enter file in which to save the key (/public/home/frank/.ssh/id_rsa): Created directory '/public/home/frank/.ssh'. Enter passphrase (empty for no passphrase): [JUST HIT ENTER] Enter same passphrase again: [JUST HIT ENTER] Your identification has been saved in /public/home/frank/.ssh/id_rsa. Your public key has been saved in /public/home/frank/.ssh/id_rsa.pub. The key fingerprint is: SHA256:dhJG1A3gcwj7Mz17ommt3NIczMVVgrzp8Tf6F1X4jpI The key's randomart image is: +---[RSA 2048]----+ | ..+o.+ ..o.| | = .. + o..| | . * .. + ..| | o = * o| | So+o + o.| | . =+oE ooo| | +o.....o| | .o++o . .| | .=o. ...| +----[SHA256]-----+ You will then have a new directory, "~/.ssh", and inside that directory you will have a file called "id_rsa.pub". That is your SSH public key. You will need this in the next step in order to set up your key in OpenStack. ==Log In To giCloud== Once you have been notified that your account has been set up and have been given login credentials, connect to the VPN and then go to this link in your favorite web browser, which is the login page: https://gicloud.prism To login, enter your username and password. Also you will see a "Domain" field, just enter the word "default" for the domain. Click "Log In". You will be logged into your group's summary page. ==Upload your SSH Public Key== After creating your new key in the above "Create a SSH Public/Private Keypair" step, you will need to upload that key into OpenStack. Once you are logged in, on the left hand navigation menu, click "Project", then in the submenu, select "Compute", and finally select "Key Pairs". It should take you to the "Key Pairs" window as shown here. [[File:Keypairs.png|900px]] Next click the "Import Public Key" button on the top right of the window. In the resulting window, name your key in the "Key Pair Name" field. Name it something descriptive like "laptop-key" if the key is on your laptop, or "mustard-key" if you are logged into mustard, etc. To get your key, open a terminal window and type "cat ~/.ssh/id_rsa.pub" to get your full key, as so: $ cat ~/.ssh/id_rsa.pub ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAyVKNfdBbDIk7Iq8JmL+u3vxAn4M1iaQgMU5tHJhMSAYBZEZRLZAFc+Qovxe5zzs1ixte9lCipLep39q2I4U8XND17nYliZ4HVM4MW4GsMUfKsgX2FI3mB2vAQ9pZSLkAhTg2D+92uALUSSv1cDZhTqo7DuPRX2Upxyd5QbRL6TRFswBjHz2vY/JpaPQm1S1d10mokPpmxehLfwp0mVgmz1Uv/6FflqiZ68DhDN67cs1yQgWYXQ01IHPjzTKRwCuZVkgT99rkoqy6TkAyrvsfzYPZbIA2y+ovOBzq6WCUT9gp5Jx/UE6CxLSmAuGPAJkV5D/twKIe75xc+5jdi3I1cgKw== user@laptop Copy that whole line, starting with "ssh-rsa" all the way through the very last character, including the "user@laptop" bit (which may be different for you, just be sure to include it in the line copy). Then back in the OpenStack Key Pair dialogue window, paste in the keypair in the "Public Key" window, then click "Import Key". The key should then appear in the key list. c568c3b4530f58abf4740bb4f5261f28a9daa898 199 198 2020-04-28T20:48:43Z Weiler 3 wikitext text/x-wiki __TOC__ ==Request an OpenStack Account== Once you have [http://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access PRISM/GI VPN access], you can request an OpenStack account. You will need to send an email to cluster-admin@soe.ucsc.edu asking for access, and let us know which lab you are in, or who your PI is, so we can place you in the right OpenStack group. ==Create a SSH Public/Private Keypair== To log into an OpenStack VM instance, you will need a SSH public key. The key is "injected" into the instance upon creation, and only someone with that key (i.e. you) will be able to log in via SSH initially. If you already have a SSH public and private key that you use elsewhere, you can use that one, and can skip to the next step. If you don't have a SSH keypair set up yet, then you will need to log into the UNIX compatible machine you will be '''logging in from''' (a Mac/Apple computer will also work), and then run the 'ssh-keygen' command. If you are behind the VPN, you can first log into mustard, crimson or razzmatazz, which are linux servers. The command will look something like this: $ ssh-keygen -t rsa Generating public/private rsa key pair. Enter file in which to save the key (/public/home/frank/.ssh/id_rsa): Created directory '/public/home/frank/.ssh'. Enter passphrase (empty for no passphrase): [JUST HIT ENTER] Enter same passphrase again: [JUST HIT ENTER] Your identification has been saved in /public/home/frank/.ssh/id_rsa. Your public key has been saved in /public/home/frank/.ssh/id_rsa.pub. The key fingerprint is: SHA256:dhJG1A3gcwj7Mz17ommt3NIczMVVgrzp8Tf6F1X4jpI The key's randomart image is: +---[RSA 2048]----+ | ..+o.+ ..o.| | = .. + o..| | . * .. + ..| | o = * o| | So+o + o.| | . =+oE ooo| | +o.....o| | .o++o . .| | .=o. ...| +----[SHA256]-----+ You will then have a new directory, "~/.ssh", and inside that directory you will have a file called "id_rsa.pub". That is your SSH public key. You will need this in the next step in order to set up your key in OpenStack. ==Log In To giCloud== Once you have been notified that your account has been set up and have been given login credentials, connect to the VPN and then go to this link in your favorite web browser, which is the login page: https://gicloud.prism To login, enter your username and password. Also you will see a "Domain" field, just enter the word "default" for the domain. Click "Log In". You will be logged into your group's summary page. ==Upload your SSH Public Key== After creating your new key in the above "Create a SSH Public/Private Keypair" step, you will need to upload that key into OpenStack. Once you are logged in, on the left hand navigation menu, click "Project", then in the submenu, select "Compute", and finally select "Key Pairs". It should take you to the "Key Pairs" window as shown here. [[File:Keypairs.png|900px]] Next click the "Import Public Key" button on the top right of the window. In the resulting window, name your key in the "Key Pair Name" field. Name it something descriptive like "laptop-key" if the key is on your laptop, or "mustard-key" if you are logged into mustard, etc. To get your key, open a terminal window and type "cat ~/.ssh/id_rsa.pub" to get your full key, as so: $ cat ~/.ssh/id_rsa.pub ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAyVKNfdBbDIk7Iq8JmL+u3vxAn4M1iaQgMU5tHJhMSAYBZEZRLZAFc+Qovxe5zzs1ixte9lCipLep39q2I4U8XND17nYliZ4HVM4MW4GsMUfKsgX2FI3mB2vAQ9pZSLkAhTg2D+92uALUSSv1cDZhTqo7DuPRX2Upxyd5QbRL6TRFswBjHz2vY/JpaPQm1S1d10mokPpmxehLfwp0mVgmz1Uv/6FflqiZ68DhDN67cs1yQgWYXQ01IHPjzTKRwCuZVkgT99rkoqy6TkAyrvsfzYPZbIA2y+ovOBzq6WCUT9gp5Jx/UE6CxLSmAuGPAJkV5D/twKIe75xc+5jdi3I1cgKw== user@laptop Copy that whole line, starting with "ssh-rsa" all the way through the very last character, including the "user@laptop" bit (which may be different for you, just be sure to include it in the line copy). Then back in the OpenStack Key Pair dialogue window, paste in the keypair in the "Public Key" window, then click "Import Key". The key should then appear in the key list. ==Launch a New Instance== We are now ready to launch our new VM instance. On the left navigation menu, select b9c441682954e2f8ffc0636f04b31da2a4a66ae4 200 199 2020-04-28T21:01:37Z Weiler 3 wikitext text/x-wiki __TOC__ ==Request an OpenStack Account== Once you have [http://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access PRISM/GI VPN access], you can request an OpenStack account. You will need to send an email to cluster-admin@soe.ucsc.edu asking for access, and let us know which lab you are in, or who your PI is, so we can place you in the right OpenStack group. ==Create a SSH Public/Private Keypair== To log into an OpenStack VM instance, you will need a SSH public key. The key is "injected" into the instance upon creation, and only someone with that key (i.e. you) will be able to log in via SSH initially. If you already have a SSH public and private key that you use elsewhere, you can use that one, and can skip to the next step. If you don't have a SSH keypair set up yet, then you will need to log into the UNIX compatible machine you will be '''logging in from''' (a Mac/Apple computer will also work), and then run the 'ssh-keygen' command. If you are behind the VPN, you can first log into mustard, crimson or razzmatazz, which are linux servers. The command will look something like this: $ ssh-keygen -t rsa Generating public/private rsa key pair. Enter file in which to save the key (/public/home/frank/.ssh/id_rsa): Created directory '/public/home/frank/.ssh'. Enter passphrase (empty for no passphrase): [JUST HIT ENTER] Enter same passphrase again: [JUST HIT ENTER] Your identification has been saved in /public/home/frank/.ssh/id_rsa. Your public key has been saved in /public/home/frank/.ssh/id_rsa.pub. The key fingerprint is: SHA256:dhJG1A3gcwj7Mz17ommt3NIczMVVgrzp8Tf6F1X4jpI The key's randomart image is: +---[RSA 2048]----+ | ..+o.+ ..o.| | = .. + o..| | . * .. + ..| | o = * o| | So+o + o.| | . =+oE ooo| | +o.....o| | .o++o . .| | .=o. ...| +----[SHA256]-----+ You will then have a new directory, "~/.ssh", and inside that directory you will have a file called "id_rsa.pub". That is your SSH public key. You will need this in the next step in order to set up your key in OpenStack. ==Log In To giCloud== Once you have been notified that your account has been set up and have been given login credentials, connect to the VPN and then go to this link in your favorite web browser, which is the login page: https://gicloud.prism To login, enter your username and password. Also you will see a "Domain" field, just enter the word "default" for the domain. Click "Log In". You will be logged into your group's summary page. ==Upload your SSH Public Key== After creating your new key in the above "Create a SSH Public/Private Keypair" step, you will need to upload that key into OpenStack. Once you are logged in, on the left hand navigation menu, click "Project", then in the submenu, select "Compute", and finally select "Key Pairs". It should take you to the "Key Pairs" window as shown here. [[File:Keypairs.png|900px]] Next click the "Import Public Key" button on the top right of the window. In the resulting window, name your key in the "Key Pair Name" field. Name it something descriptive like "laptop-key" if the key is on your laptop, or "mustard-key" if you are logged into mustard, etc. To get your key, open a terminal window and type "cat ~/.ssh/id_rsa.pub" to get your full key, as so: $ cat ~/.ssh/id_rsa.pub ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAyVKNfdBbDIk7Iq8JmL+u3vxAn4M1iaQgMU5tHJhMSAYBZEZRLZAFc+Qovxe5zzs1ixte9lCipLep39q2I4U8XND17nYliZ4HVM4MW4GsMUfKsgX2FI3mB2vAQ9pZSLkAhTg2D+92uALUSSv1cDZhTqo7DuPRX2Upxyd5QbRL6TRFswBjHz2vY/JpaPQm1S1d10mokPpmxehLfwp0mVgmz1Uv/6FflqiZ68DhDN67cs1yQgWYXQ01IHPjzTKRwCuZVkgT99rkoqy6TkAyrvsfzYPZbIA2y+ovOBzq6WCUT9gp5Jx/UE6CxLSmAuGPAJkV5D/twKIe75xc+5jdi3I1cgKw== user@laptop Copy that whole line, starting with "ssh-rsa" all the way through the very last character, including the "user@laptop" bit (which may be different for you, just be sure to include it in the line copy). Then back in the OpenStack Key Pair dialogue window, paste in the keypair in the "Public Key" window, then click "Import Key". The key should then appear in the key list. ==Launch a New Instance== We are now ready to launch our new VM instance. On the left navigation menu, select "Project", then in the submenu, select "Compute", and finally select "Instances". You will see any currently running instances in your group, as seen in the image below: 4cda5ac4c1a2edc52d3eb63df4d23a2e0991c999 201 200 2020-04-28T21:14:53Z Weiler 3 wikitext text/x-wiki __TOC__ ==Request an OpenStack Account== Once you have [http://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access PRISM/GI VPN access], you can request an OpenStack account. You will need to send an email to cluster-admin@soe.ucsc.edu asking for access, and let us know which lab you are in, or who your PI is, so we can place you in the right OpenStack group. ==Create a SSH Public/Private Keypair== To log into an OpenStack VM instance, you will need a SSH public key. The key is "injected" into the instance upon creation, and only someone with that key (i.e. you) will be able to log in via SSH initially. If you already have a SSH public and private key that you use elsewhere, you can use that one, and can skip to the next step. If you don't have a SSH keypair set up yet, then you will need to log into the UNIX compatible machine you will be '''logging in from''' (a Mac/Apple computer will also work), and then run the 'ssh-keygen' command. If you are behind the VPN, you can first log into mustard, crimson or razzmatazz, which are linux servers. The command will look something like this: $ ssh-keygen -t rsa Generating public/private rsa key pair. Enter file in which to save the key (/public/home/frank/.ssh/id_rsa): Created directory '/public/home/frank/.ssh'. Enter passphrase (empty for no passphrase): [JUST HIT ENTER] Enter same passphrase again: [JUST HIT ENTER] Your identification has been saved in /public/home/frank/.ssh/id_rsa. Your public key has been saved in /public/home/frank/.ssh/id_rsa.pub. The key fingerprint is: SHA256:dhJG1A3gcwj7Mz17ommt3NIczMVVgrzp8Tf6F1X4jpI The key's randomart image is: +---[RSA 2048]----+ | ..+o.+ ..o.| | = .. + o..| | . * .. + ..| | o = * o| | So+o + o.| | . =+oE ooo| | +o.....o| | .o++o . .| | .=o. ...| +----[SHA256]-----+ You will then have a new directory, "~/.ssh", and inside that directory you will have a file called "id_rsa.pub". That is your SSH public key. You will need this in the next step in order to set up your key in OpenStack. ==Log In To giCloud== Once you have been notified that your account has been set up and have been given login credentials, connect to the VPN and then go to this link in your favorite web browser, which is the login page: https://gicloud.prism To login, enter your username and password. Also you will see a "Domain" field, just enter the word "default" for the domain. Click "Log In". You will be logged into your group's summary page. ==Upload your SSH Public Key== After creating your new key in the above "Create a SSH Public/Private Keypair" step, you will need to upload that key into OpenStack. Once you are logged in, on the left hand navigation menu, click "Project", then in the submenu, select "Compute", and finally select "Key Pairs". It should take you to the "Key Pairs" window as shown here. [[File:Keypairs.png|900px]] Next click the "Import Public Key" button on the top right of the window. In the resulting window, name your key in the "Key Pair Name" field. Name it something descriptive like "laptop-key" if the key is on your laptop, or "mustard-key" if you are logged into mustard, etc. To get your key, open a terminal window and type "cat ~/.ssh/id_rsa.pub" to get your full key, as so: $ cat ~/.ssh/id_rsa.pub ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAyVKNfdBbDIk7Iq8JmL+u3vxAn4M1iaQgMU5tHJhMSAYBZEZRLZAFc+Qovxe5zzs1ixte9lCipLep39q2I4U8XND17nYliZ4HVM4MW4GsMUfKsgX2FI3mB2vAQ9pZSLkAhTg2D+92uALUSSv1cDZhTqo7DuPRX2Upxyd5QbRL6TRFswBjHz2vY/JpaPQm1S1d10mokPpmxehLfwp0mVgmz1Uv/6FflqiZ68DhDN67cs1yQgWYXQ01IHPjzTKRwCuZVkgT99rkoqy6TkAyrvsfzYPZbIA2y+ovOBzq6WCUT9gp5Jx/UE6CxLSmAuGPAJkV5D/twKIe75xc+5jdi3I1cgKw== user@laptop Copy that whole line, starting with "ssh-rsa" all the way through the very last character, including the "user@laptop" bit (which may be different for you, just be sure to include it in the line copy). Then back in the OpenStack Key Pair dialogue window, paste in the keypair in the "Public Key" window, then click "Import Key". The key should then appear in the key list. ==Launch a New Instance== We are now ready to launch our new VM instance. On the left navigation menu, select "Project", then in the submenu, select "Compute", and finally select "Instances". You will see any currently running instances in your group in the resulting screen. Next you need to click the "Launch Instance" button on the top right. You will be put into the "Details" tab in the instance creation dialogue. You need to choose an instance name and enter it into the "Instance Name" field. It should include your username as a prefix so that others know who owns each instance. Something like "frank-newtest1" would work well. You can ignore the "Description" field, "Availability Zone" should be "nova: and "Count" should be "1". Next click the "Source" tab on the left. In the "Source" menu, in the "Select Boot Source" field, select "Image". Then in the below list of images, choose your image and click the little "Up Arrow" icon to the right of the image you want to add it. Next click the "Flavor" tab on the left. In that menu, choose how much CPU, RAM and disk space you want for your new VM. Some images have minimum requirements, and as such some of the smaller flavors may not be available. Select your flavor by clicking the little "Up Arrow" icon on the right of your flavor. Next click the "Key Pair" tab on the left. Click the little "Up Arrow" to the right of the Kep Pair you created in the previous step where you create a Key Pair. Ignore the rest of the options on the left, you have configured all you need to launch the instance. Click the blue "Launch Instance" button on the bottom right of your window. b2e608fe1c0e93dfc1886c35c26ceb33039ae6c8 File:Keypairs.png 6 27 190 2020-04-28T20:31:50Z Weiler 3 keypairs.png wikitext text/x-wiki keypairs.png acb448a9211058dbaa0565c92046ecfceeffd023 File:Launch.png 6 28 202 2020-04-28T21:16:45Z Weiler 3 wikitext text/x-wiki da39a3ee5e6b4b0d3255bfef95601890afd80709 Quick Start Instructions to Get Rolling with OpenStack 0 26 203 201 2020-04-28T21:17:58Z Weiler 3 wikitext text/x-wiki __TOC__ ==Request an OpenStack Account== Once you have [http://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access PRISM/GI VPN access], you can request an OpenStack account. You will need to send an email to cluster-admin@soe.ucsc.edu asking for access, and let us know which lab you are in, or who your PI is, so we can place you in the right OpenStack group. ==Create a SSH Public/Private Keypair== To log into an OpenStack VM instance, you will need a SSH public key. The key is "injected" into the instance upon creation, and only someone with that key (i.e. you) will be able to log in via SSH initially. If you already have a SSH public and private key that you use elsewhere, you can use that one, and can skip to the next step. If you don't have a SSH keypair set up yet, then you will need to log into the UNIX compatible machine you will be '''logging in from''' (a Mac/Apple computer will also work), and then run the 'ssh-keygen' command. If you are behind the VPN, you can first log into mustard, crimson or razzmatazz, which are linux servers. The command will look something like this: $ ssh-keygen -t rsa Generating public/private rsa key pair. Enter file in which to save the key (/public/home/frank/.ssh/id_rsa): Created directory '/public/home/frank/.ssh'. Enter passphrase (empty for no passphrase): [JUST HIT ENTER] Enter same passphrase again: [JUST HIT ENTER] Your identification has been saved in /public/home/frank/.ssh/id_rsa. Your public key has been saved in /public/home/frank/.ssh/id_rsa.pub. The key fingerprint is: SHA256:dhJG1A3gcwj7Mz17ommt3NIczMVVgrzp8Tf6F1X4jpI The key's randomart image is: +---[RSA 2048]----+ | ..+o.+ ..o.| | = .. + o..| | . * .. + ..| | o = * o| | So+o + o.| | . =+oE ooo| | +o.....o| | .o++o . .| | .=o. ...| +----[SHA256]-----+ You will then have a new directory, "~/.ssh", and inside that directory you will have a file called "id_rsa.pub". That is your SSH public key. You will need this in the next step in order to set up your key in OpenStack. ==Log In To giCloud== Once you have been notified that your account has been set up and have been given login credentials, connect to the VPN and then go to this link in your favorite web browser, which is the login page: https://gicloud.prism To login, enter your username and password. Also you will see a "Domain" field, just enter the word "default" for the domain. Click "Log In". You will be logged into your group's summary page. ==Upload your SSH Public Key== After creating your new key in the above "Create a SSH Public/Private Keypair" step, you will need to upload that key into OpenStack. Once you are logged in, on the left hand navigation menu, click "Project", then in the submenu, select "Compute", and finally select "Key Pairs". It should take you to the "Key Pairs" window as shown here. [[File:Keypairs.png|900px]] Next click the "Import Public Key" button on the top right of the window. In the resulting window, name your key in the "Key Pair Name" field. Name it something descriptive like "laptop-key" if the key is on your laptop, or "mustard-key" if you are logged into mustard, etc. To get your key, open a terminal window and type "cat ~/.ssh/id_rsa.pub" to get your full key, as so: $ cat ~/.ssh/id_rsa.pub ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAyVKNfdBbDIk7Iq8JmL+u3vxAn4M1iaQgMU5tHJhMSAYBZEZRLZAFc+Qovxe5zzs1ixte9lCipLep39q2I4U8XND17nYliZ4HVM4MW4GsMUfKsgX2FI3mB2vAQ9pZSLkAhTg2D+92uALUSSv1cDZhTqo7DuPRX2Upxyd5QbRL6TRFswBjHz2vY/JpaPQm1S1d10mokPpmxehLfwp0mVgmz1Uv/6FflqiZ68DhDN67cs1yQgWYXQ01IHPjzTKRwCuZVkgT99rkoqy6TkAyrvsfzYPZbIA2y+ovOBzq6WCUT9gp5Jx/UE6CxLSmAuGPAJkV5D/twKIe75xc+5jdi3I1cgKw== user@laptop Copy that whole line, starting with "ssh-rsa" all the way through the very last character, including the "user@laptop" bit (which may be different for you, just be sure to include it in the line copy). Then back in the OpenStack Key Pair dialogue window, paste in the keypair in the "Public Key" window, then click "Import Key". The key should then appear in the key list. ==Launch a New Instance== We are now ready to launch our new VM instance. On the left navigation menu, select "Project", then in the submenu, select "Compute", and finally select "Instances". You will see any currently running instances in your group in the resulting screen. Next you need to click the "Launch Instance" button on the top right. You will be put into the "Details" tab in the instance creation dialogue. You need to choose an instance name and enter it into the "Instance Name" field. It should include your username as a prefix so that others know who owns each instance. Something like "frank-newtest1" would work well. You can ignore the "Description" field, "Availability Zone" should be "nova: and "Count" should be "1". Next click the "Source" tab on the left. In the "Source" menu, in the "Select Boot Source" field, select "Image". Then in the below list of images, choose your image and click the little "Up Arrow" icon to the right of the image you want to add it. Next click the "Flavor" tab on the left. In that menu, choose how much CPU, RAM and disk space you want for your new VM. Some images have minimum requirements, and as such some of the smaller flavors may not be available. Select your flavor by clicking the little "Up Arrow" icon on the right of your flavor. Next click the "Key Pair" tab on the left. Click the little "Up Arrow" to the right of the Kep Pair you created in the previous step where you create a Key Pair. Ignore the rest of the options on the left, you have configured all you need to launch the instance. Click the blue "Launch Instance" button on the bottom right of your window, as seen below: [[File:Launch.png|850px]] f3e78d34a31b7f20a435a08bfacbf5f0a62654ad 204 203 2020-04-28T22:02:15Z Weiler 3 wikitext text/x-wiki __TOC__ ==Request an OpenStack Account== Once you have [http://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access PRISM/GI VPN access], you can request an OpenStack account. You will need to send an email to cluster-admin@soe.ucsc.edu asking for access, and let us know which lab you are in, or who your PI is, so we can place you in the right OpenStack group. ==Create a SSH Public/Private Keypair== To log into an OpenStack VM instance, you will need a SSH public key. The key is "injected" into the instance upon creation, and only someone with that key (i.e. you) will be able to log in via SSH initially. If you already have a SSH public and private key that you use elsewhere, you can use that one, and can skip to the next step. If you don't have a SSH keypair set up yet, then you will need to log into the UNIX compatible machine you will be '''logging in from''' (a Mac/Apple computer will also work), and then run the 'ssh-keygen' command. If you are behind the VPN, you can first log into mustard, crimson or razzmatazz, which are linux servers. The command will look something like this: $ ssh-keygen -t rsa Generating public/private rsa key pair. Enter file in which to save the key (/public/home/frank/.ssh/id_rsa): Created directory '/public/home/frank/.ssh'. Enter passphrase (empty for no passphrase): [JUST HIT ENTER] Enter same passphrase again: [JUST HIT ENTER] Your identification has been saved in /public/home/frank/.ssh/id_rsa. Your public key has been saved in /public/home/frank/.ssh/id_rsa.pub. The key fingerprint is: SHA256:dhJG1A3gcwj7Mz17ommt3NIczMVVgrzp8Tf6F1X4jpI The key's randomart image is: +---[RSA 2048]----+ | ..+o.+ ..o.| | = .. + o..| | . * .. + ..| | o = * o| | So+o + o.| | . =+oE ooo| | +o.....o| | .o++o . .| | .=o. ...| +----[SHA256]-----+ You will then have a new directory, "~/.ssh", and inside that directory you will have a file called "id_rsa.pub". That is your SSH public key. You will need this in the next step in order to set up your key in OpenStack. ==Log In To giCloud== Once you have been notified that your account has been set up and have been given login credentials, connect to the VPN and then go to this link in your favorite web browser, which is the login page: https://gicloud.prism To login, enter your username and password. Also you will see a "Domain" field, just enter the word "default" for the domain. Click "Log In". You will be logged into your group's summary page. ==Upload your SSH Public Key== After creating your new key in the above "Create a SSH Public/Private Keypair" step, you will need to upload that key into OpenStack. Once you are logged in, on the left hand navigation menu, click "Project", then in the submenu, select "Compute", and finally select "Key Pairs". It should take you to the "Key Pairs" window as shown here. [[File:Keypairs.png|900px]] Next click the "Import Public Key" button on the top right of the window. In the resulting window, name your key in the "Key Pair Name" field. Name it something descriptive like "laptop-key" if the key is on your laptop, or "mustard-key" if you are logged into mustard, etc. To get your key, open a terminal window and type "cat ~/.ssh/id_rsa.pub" to get your full key, as so: $ cat ~/.ssh/id_rsa.pub ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAyVKNfdBbDIk7Iq8JmL+u3vxAn4M1iaQgMU5tHJhMSAYBZEZRLZAFc+Qovxe5zzs1ixte9lCipLep39q2I4U8XND17nYliZ4HVM4MW4GsMUfKsgX2FI3mB2vAQ9pZSLkAhTg2D+92uALUSSv1cDZhTqo7DuPRX2Upxyd5QbRL6TRFswBjHz2vY/JpaPQm1S1d10mokPpmxehLfwp0mVgmz1Uv/6FflqiZ68DhDN67cs1yQgWYXQ01IHPjzTKRwCuZVkgT99rkoqy6TkAyrvsfzYPZbIA2y+ovOBzq6WCUT9gp5Jx/UE6CxLSmAuGPAJkV5D/twKIe75xc+5jdi3I1cgKw== user@laptop Copy that whole line, starting with "ssh-rsa" all the way through the very last character, including the "user@laptop" bit (which may be different for you, just be sure to include it in the line copy). Then back in the OpenStack Key Pair dialogue window, paste in the keypair in the "Public Key" window, then click "Import Key". The key should then appear in the key list. ==Launch a New Instance== We are now ready to launch our new VM instance. On the left navigation menu, select "Project", then in the submenu, select "Compute", and finally select "Instances". You will see any currently running instances in your group in the resulting screen. Next you need to click the "Launch Instance" button on the top right. You will be put into the "Details" tab in the instance creation dialogue. You need to choose an instance name and enter it into the "Instance Name" field. It should include your username as a prefix so that others know who owns each instance. Something like "frank-newtest1" would work well. You can ignore the "Description" field, "Availability Zone" should be "nova: and "Count" should be "1". Next click the "Source" tab on the left. In the "Source" menu, in the "Select Boot Source" field, select "Image". Then in the below list of images, choose your image and click the little "Up Arrow" icon to the right of the image you want to add it. Next click the "Flavor" tab on the left. In that menu, choose how much CPU, RAM and disk space you want for your new VM. Some images have minimum requirements, and as such some of the smaller flavors may not be available. Select your flavor by clicking the little "Up Arrow" icon on the right of your flavor. Next click the "Key Pair" tab on the left. Click the little "Up Arrow" to the right of the Kep Pair you created in the previous step where you create a Key Pair. Ignore the rest of the options on the left, you have configured all you need to launch the instance. Click the blue "Launch Instance" button on the bottom right of your window, as seen below: [[File:Launch.png|850px]] You will be taken back to the Instances Summary page and you should see your new instance launching. After a bit your instance will change from the "Spawning" to "Running". This means the instance is now booting, and should finish booting in a minute or two. In the meantime we will need to attach a "Floating IP" address to your instance such that you can SSH into the instance. On the right side of your running instance, you should see a drop-down menu, usually the "Create Snapshot" option is pre-selected. Click the drop down menu arrow to open that menu, and select "Associate Floating IP". In the "Associate Floating IP" dialogue, click the drop down menu to see if any IP addresses are already available, and if so, go ahead and select one. If there are none available, click the little "+" button to the right to allocate a floating IP address. It will ask you what Pool to use, select "ext-net". You can put in a description if you want but most folks leave that field blank. Then click "Allocate IP". It will take you back one menu level. It will have a field "Port to be Associated", just leave that alone with the default that is already there. Click the blue "Associate" button on the bottom right of the window. You will be returned to the "Instances Summary" page again. You will see your instance running, and it should now list a "Floating IP" that it is running under. That is the IP that you will use to SSH to the instance. ==Connect to Your New Instance== Now that your instance is up and running, let's SSH to it and get going! '''From the computer you created your SSH keys on,''' SSH to your instance using the username as the OS type you chose (ubuntu, centos, etc), and the Floating IP address your instance has. '''You must be connected to the VPN for this to work!''' Example: $ ssh ubuntu@10.50.100.67 If you launched a CentOS instance, it would instead be "ssh centos@10.50.100.67", as appropriate. Assuming everything went as planned, you will be logged into your new Linux instance as the "ubuntu" or "centos" user, which is an unprivileged user. If you get a "Connection Refused" error when trying to SSH in, it means your instance isn't quite through launching yet, try again in about 30 seconds. You have full sudo rights however to do whatever administration you need to do. At this point it is assumed you have a little systems administration skills in your belt, or at least have some time to query Google as to how to perform various Linux tasks as necessary. Your instance has full Intenret access to the Greater Internet, so you can download thing fro the Internet, run "apt-get install" or "yum update" or whatever is appropriate. You can also then install any needed software you need to get your work done. '''NOTE:''' Your are the Systems Administrator of your instance - we cannot support questions on how to administer Linux for you. If OpenStack itself is having issues then please let us know, but please defer questions like "How do I install software on Ubuntu" to Google searches. ==Storage on Your New Instance== Most of your storage on your new instance will be located in the /mnt directory, as seen by a "df -h" command on your instance: ubuntu@erich1:~$ df -h Filesystem Size Used Avail Use% Mounted on udev 16G 0 16G 0% /dev tmpfs 3.2G 676K 3.2G 1% /run /dev/vda1 20G 975M 19G 5% / tmpfs 16G 0 16G 0% /dev/shm tmpfs 5.0M 0 5.0M 0% /run/lock tmpfs 16G 0 16G 0% /sys/fs/cgroup /dev/vda15 105M 3.4M 102M 4% /boot/efi /dev/vdb1 1.0T 1.1G 1023G 1% /mnt tmpfs 3.2G 0 3.2G 0% /run/user/1000 Notice that "/mnt" has 1TB of disk space, so store all your big important data in /mnt. Avoid storing data on "/" whenever possible to prevent issues with the root filesystem filling up. The exact amount of storage available will depend on what flavor you chose when creating the instance. ==Instance Control Options== Just a few notes on controlling your instances. They are fully functioning Linux machines, so a "sudo reboot" will reboot the machine, "sudo poweroff" will shut it down, etc. In cloud parlance, "Shut Down" means the instance is still there but the power is off. "Terminated" means it's fully deleted and is unrecoverable, so be sure you want to delete your instance before you do so. We do not back instances up. We also have no access to your instance so we cannot log in and see what's going on. You can control your instance in several ways from the OpenStack web interface, in the Instance Summary page. On the right side of your instance in the list will be that little drop down menu. Options of interest are: '''1: Create Snapshot''' Never use this option as we have not implemented snapshotting in this environment. '''2: View Log''' This will show you the boot/console log of the instance, so you can see if anything is causing issues. '''3: Hard Reboot Instance''' This will hard reboot your instance, kind of like hitting the power button to power the instance off, then it will power back on moments later. Useful if your instance is hosed because of a software crash or other things that may have crashed the instance. '''4: Delete Instance''' This will permanently destroy your instance. It will be deleted and is unrecoverable. It will also free up the resources it was using such that others can use them however. This is useful if the group quotas have been reached and some old instances need to be cleaned out to make room for new ones. '''5: Start Instance''' This option will be available if the instance is in the "Shut Down" state. It will boot up the instance when this option is invoked. Do not use the other options you may see there, most have not been implemented in our deployment of OpenStack. ==Etiquette== There is one main thing to remember when using instances in OpenStack. When you create an instance, it uses CPU, RAM and most importantly, it pins disk space for that instance. If you use up all the disk, CPU and RAM quota for your group, then others have no resources left to create their own instances. It is important to know that the best plan of action is to fire up your VM and keep it up when you need it, and then copy your data off it and delete the instance. Document steps takemn to create your instance such that you could do it again if you needed to. If the physical node that your instance resides on blows up, then your instance is lost forever and we have no backups, so it is up to you to back up important data. It's also not good form to spin up an instance and store data there, but not log in for months at a time. Then, you are pinning resources that other may need for urgent work. Try to be a good neighbor! 4348b654cf5869ea105a91e25bb8573b679bad74 205 204 2020-04-28T22:18:06Z Weiler 3 wikitext text/x-wiki __TOC__ ==Request an OpenStack Account== Once you have [http://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access PRISM/GI VPN access], you can request an OpenStack account. You will need to send an email to cluster-admin@soe.ucsc.edu asking for access, and let us know which lab you are in, or who your PI is, so we can place you in the right OpenStack group. ==Create a SSH Public/Private Keypair== To log into an OpenStack VM instance, you will need a SSH public key. The key is "injected" into the instance upon creation, and only someone with that key (i.e. you) will be able to log in via SSH initially. If you already have a SSH public and private key that you use elsewhere, you can use that one, and can skip to the next step. If you don't have a SSH keypair set up yet, then you will need to log into the UNIX compatible machine you will be '''logging in from''' (a Mac/Apple computer will also work), and then run the 'ssh-keygen' command. If you are behind the VPN, you can first log into mustard, crimson or razzmatazz, which are linux servers. The command will look something like this: $ ssh-keygen -t rsa Generating public/private rsa key pair. Enter file in which to save the key (/public/home/frank/.ssh/id_rsa): Created directory '/public/home/frank/.ssh'. Enter passphrase (empty for no passphrase): [JUST HIT ENTER] Enter same passphrase again: [JUST HIT ENTER] Your identification has been saved in /public/home/frank/.ssh/id_rsa. Your public key has been saved in /public/home/frank/.ssh/id_rsa.pub. The key fingerprint is: SHA256:dhJG1A3gcwj7Mz17ommt3NIczMVVgrzp8Tf6F1X4jpI The key's randomart image is: +---[RSA 2048]----+ | ..+o.+ ..o.| | = .. + o..| | . * .. + ..| | o = * o| | So+o + o.| | . =+oE ooo| | +o.....o| | .o++o . .| | .=o. ...| +----[SHA256]-----+ You will then have a new directory, "~/.ssh", and inside that directory you will have a file called "id_rsa.pub". That is your SSH public key. You will need this in the next step in order to set up your key in OpenStack. ==Log In To giCloud== Once you have been notified that your account has been set up and have been given login credentials, connect to the VPN and then go to this link in your favorite web browser, which is the login page: https://gicloud.prism To login, enter your username and password. Also you will see a "Domain" field, just enter the word "default" for the domain. Click "Log In". You will be logged into your group's summary page. ==Upload your SSH Public Key== After creating your new key in the above "Create a SSH Public/Private Keypair" step, you will need to upload that key into OpenStack. Once you are logged in, on the left hand navigation menu, click "Project", then in the submenu, select "Compute", and finally select "Key Pairs". It should take you to the "Key Pairs" window as shown here. [[File:Keypairs.png|900px]] Next click the "Import Public Key" button on the top right of the window. In the resulting window, name your key in the "Key Pair Name" field. Name it something descriptive like "laptop-key" if the key is on your laptop, or "mustard-key" if you are logged into mustard, etc. To get your key, open a terminal window and type "cat ~/.ssh/id_rsa.pub" to get your full key, as so: $ cat ~/.ssh/id_rsa.pub ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAyVKNfdBbDIk7Iq8JmL+u3vxAn4M1iaQgMU5tHJhMSAYBZEZRLZAFc+Qovxe5zzs1ixte9lCipLep39q2I4U8XND17nYliZ4HVM4MW4GsMUfKsgX2FI3mB2vAQ9pZSLkAhTg2D+92uALUSSv1cDZhTqo7DuPRX2Upxyd5QbRL6TRFswBjHz2vY/JpaPQm1S1d10mokPpmxehLfwp0mVgmz1Uv/6FflqiZ68DhDN67cs1yQgWYXQ01IHPjzTKRwCuZVkgT99rkoqy6TkAyrvsfzYPZbIA2y+ovOBzq6WCUT9gp5Jx/UE6CxLSmAuGPAJkV5D/twKIe75xc+5jdi3I1cgKw== user@laptop Copy that whole line, starting with "ssh-rsa" all the way through the very last character, including the "user@laptop" bit (which may be different for you, just be sure to include it in the line copy). Then back in the OpenStack Key Pair dialogue window, paste in the keypair in the "Public Key" window, then click "Import Key". The key should then appear in the key list. ==Launch a New Instance== We are now ready to launch our new VM instance. On the left navigation menu, select "Project", then in the submenu, select "Compute", and finally select "Instances". You will see any currently running instances in your group in the resulting screen. Next you need to click the "Launch Instance" button on the top right. You will be put into the "Details" tab in the instance creation dialogue. You need to choose an instance name and enter it into the "Instance Name" field. It should include your username as a prefix so that others know who owns each instance. Something like "frank-newtest1" would work well. You can ignore the "Description" field, "Availability Zone" should be "nova: and "Count" should be "1". Next click the "Source" tab on the left. In the "Source" menu, in the "Select Boot Source" field, select "Image". Then in the below list of images, choose your image and click the little "Up Arrow" icon to the right of the image you want to add it. Next click the "Flavor" tab on the left. In that menu, choose how much CPU, RAM and disk space you want for your new VM. Some images have minimum requirements, and as such some of the smaller flavors may not be available. Select your flavor by clicking the little "Up Arrow" icon on the right of your flavor. Next click the "Key Pair" tab on the left. Click the little "Up Arrow" to the right of the Kep Pair you created in the previous step where you create a Key Pair. Ignore the rest of the options on the left, you have configured all you need to launch the instance. Click the blue "Launch Instance" button on the bottom right of your window, as seen below: [[File:Launch.png|850px]] You will be taken back to the Instances Summary page and you should see your new instance launching. After a bit your instance will change from the "Spawning" to "Running". This means the instance is now booting, and should finish booting in a minute or two. In the meantime we will need to attach a "Floating IP" address to your instance such that you can SSH into the instance. On the right side of your running instance, you should see a drop-down menu, usually the "Create Snapshot" option is pre-selected. Click the drop down menu arrow to open that menu, and select "Associate Floating IP". In the "Associate Floating IP" dialogue, click the drop down menu to see if any IP addresses are already available, and if so, go ahead and select one. If there are none available, click the little "+" button to the right to allocate a floating IP address. It will ask you what Pool to use, select "ext-net". You can put in a description if you want but most folks leave that field blank. Then click "Allocate IP". It will take you back one menu level. It will have a field "Port to be Associated", just leave that alone with the default that is already there. Click the blue "Associate" button on the bottom right of the window. You will be returned to the "Instances Summary" page again. You will see your instance running, and it should now list a "Floating IP" that it is running under. That is the IP that you will use to SSH to the instance. ==Connect to Your New Instance== Now that your instance is up and running, let's SSH to it and get going! '''From the computer you created your SSH keys on,''' SSH to your instance using the username as the OS type you chose (ubuntu, centos, etc), and the Floating IP address your instance has. '''You must be connected to the VPN for this to work!''' Example: $ ssh ubuntu@10.50.100.67 If you launched a CentOS instance, it would instead be "ssh centos@10.50.100.67", as appropriate. Assuming everything went as planned, you will be logged into your new Linux instance as the "ubuntu" or "centos" user, which is an unprivileged user. If you get a "Connection Refused" error when trying to SSH in, it means your instance isn't quite through launching yet, try again in about 30 seconds. You have full sudo rights however to do whatever administration you need to do. At this point it is assumed you have a little systems administration skills in your belt, or at least have some time to query Google as to how to perform various Linux tasks as necessary. Your instance has full Intenret access to the Greater Internet, so you can download thing fro the Internet, run "apt-get install" or "yum update" or whatever is appropriate. You can also then install any needed software you need to get your work done. '''NOTE:''' Your are the Systems Administrator of your instance - we cannot support questions on how to administer Linux for you. If OpenStack itself is having issues then please let us know, but please defer questions like "How do I install software on Ubuntu" to Google searches. ==Storage on Your New Instance== Most of your storage on your new instance will be located in the /mnt directory, as seen by a "df -h" command on your instance: ubuntu@erich1:~$ df -h Filesystem Size Used Avail Use% Mounted on udev 16G 0 16G 0% /dev tmpfs 3.2G 676K 3.2G 1% /run /dev/vda1 20G 975M 19G 5% / tmpfs 16G 0 16G 0% /dev/shm tmpfs 5.0M 0 5.0M 0% /run/lock tmpfs 16G 0 16G 0% /sys/fs/cgroup /dev/vda15 105M 3.4M 102M 4% /boot/efi /dev/vdb1 1.0T 1.1G 1023G 1% /mnt tmpfs 3.2G 0 3.2G 0% /run/user/1000 Notice that "/mnt" has 1TB of disk space, so store all your big important data in /mnt. Avoid storing data on "/" whenever possible to prevent issues with the root filesystem filling up. The exact amount of storage available will depend on what flavor you chose when creating the instance. ==Instance Control Options== Just a few notes on controlling your instances. They are fully functioning Linux machines, so a "sudo reboot" will reboot the machine, "sudo poweroff" will shut it down, etc. In cloud parlance, "Shut Down" means the instance is still there but the power is off. "Terminated" means it's fully deleted and is unrecoverable, so be sure you want to delete your instance before you do so. We do not back instances up. We also have no access to your instance so we cannot log in and see what's going on. You can control your instance in several ways from the OpenStack web interface, in the Instance Summary page. On the right side of your instance in the list will be that little drop down menu. Options of interest are: '''1: Create Snapshot''' Never use this option as we have not implemented snapshotting in this environment. '''2: View Log''' This will show you the boot/console log of the instance, so you can see if anything is causing issues. '''3: Hard Reboot Instance''' This will hard reboot your instance, kind of like hitting the power button to power the instance off, then it will power back on moments later. Useful if your instance is hosed because of a software crash or other things that may have crashed the instance. '''4: Delete Instance''' This will permanently destroy your instance. It will be deleted and is unrecoverable. It will also free up the resources it was using such that others can use them however. This is useful if the group quotas have been reached and some old instances need to be cleaned out to make room for new ones. '''5: Start Instance''' This option will be available if the instance is in the "Shut Down" state. It will boot up the instance when this option is invoked. Do not use the other options you may see there, most have not been implemented in our deployment of OpenStack. ==Changing Your OpenStack Web Interface Password== Once you have logged in to the Web Interface, you can change your password by doing the following. On the top right of the OpenStack web interface, you should see a little icon with your username on it. Click that icon to expand the drop down menu there, and select "Settings". Then in the next window, on the left navigation bar, you should see the "Change Password" button. Complete the Change Password dialogue to change your password. You may have to log in again after changing your password. ==Etiquette== There is one main thing to remember when using instances in OpenStack. When you create an instance, it uses CPU, RAM and most importantly, it pins disk space for that instance. If you use up all the disk, CPU and RAM quota for your group, then others have no resources left to create their own instances. It is important to know that the best plan of action is to fire up your VM and keep it up when you need it, and then copy your data off it and delete the instance. Document steps taken to create your instance such that you could do it again if you needed to. If the physical node that your instance resides on blows up, then your instance is lost forever and we have no backups, so it is up to you to back up important data. It's also not good form to spin up an instance and store data there, but not log in for months at a time. Then, you are pinning resources that other may need for urgent work. Try to be a good neighbor! 9e80c9c5b7daed006c65042c585f0df40ca002ce 206 205 2020-04-28T22:58:08Z Weiler 3 /* Connect to Your New Instance */ wikitext text/x-wiki __TOC__ ==Request an OpenStack Account== Once you have [http://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access PRISM/GI VPN access], you can request an OpenStack account. You will need to send an email to cluster-admin@soe.ucsc.edu asking for access, and let us know which lab you are in, or who your PI is, so we can place you in the right OpenStack group. ==Create a SSH Public/Private Keypair== To log into an OpenStack VM instance, you will need a SSH public key. The key is "injected" into the instance upon creation, and only someone with that key (i.e. you) will be able to log in via SSH initially. If you already have a SSH public and private key that you use elsewhere, you can use that one, and can skip to the next step. If you don't have a SSH keypair set up yet, then you will need to log into the UNIX compatible machine you will be '''logging in from''' (a Mac/Apple computer will also work), and then run the 'ssh-keygen' command. If you are behind the VPN, you can first log into mustard, crimson or razzmatazz, which are linux servers. The command will look something like this: $ ssh-keygen -t rsa Generating public/private rsa key pair. Enter file in which to save the key (/public/home/frank/.ssh/id_rsa): Created directory '/public/home/frank/.ssh'. Enter passphrase (empty for no passphrase): [JUST HIT ENTER] Enter same passphrase again: [JUST HIT ENTER] Your identification has been saved in /public/home/frank/.ssh/id_rsa. Your public key has been saved in /public/home/frank/.ssh/id_rsa.pub. The key fingerprint is: SHA256:dhJG1A3gcwj7Mz17ommt3NIczMVVgrzp8Tf6F1X4jpI The key's randomart image is: +---[RSA 2048]----+ | ..+o.+ ..o.| | = .. + o..| | . * .. + ..| | o = * o| | So+o + o.| | . =+oE ooo| | +o.....o| | .o++o . .| | .=o. ...| +----[SHA256]-----+ You will then have a new directory, "~/.ssh", and inside that directory you will have a file called "id_rsa.pub". That is your SSH public key. You will need this in the next step in order to set up your key in OpenStack. ==Log In To giCloud== Once you have been notified that your account has been set up and have been given login credentials, connect to the VPN and then go to this link in your favorite web browser, which is the login page: https://gicloud.prism To login, enter your username and password. Also you will see a "Domain" field, just enter the word "default" for the domain. Click "Log In". You will be logged into your group's summary page. ==Upload your SSH Public Key== After creating your new key in the above "Create a SSH Public/Private Keypair" step, you will need to upload that key into OpenStack. Once you are logged in, on the left hand navigation menu, click "Project", then in the submenu, select "Compute", and finally select "Key Pairs". It should take you to the "Key Pairs" window as shown here. [[File:Keypairs.png|900px]] Next click the "Import Public Key" button on the top right of the window. In the resulting window, name your key in the "Key Pair Name" field. Name it something descriptive like "laptop-key" if the key is on your laptop, or "mustard-key" if you are logged into mustard, etc. To get your key, open a terminal window and type "cat ~/.ssh/id_rsa.pub" to get your full key, as so: $ cat ~/.ssh/id_rsa.pub ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAyVKNfdBbDIk7Iq8JmL+u3vxAn4M1iaQgMU5tHJhMSAYBZEZRLZAFc+Qovxe5zzs1ixte9lCipLep39q2I4U8XND17nYliZ4HVM4MW4GsMUfKsgX2FI3mB2vAQ9pZSLkAhTg2D+92uALUSSv1cDZhTqo7DuPRX2Upxyd5QbRL6TRFswBjHz2vY/JpaPQm1S1d10mokPpmxehLfwp0mVgmz1Uv/6FflqiZ68DhDN67cs1yQgWYXQ01IHPjzTKRwCuZVkgT99rkoqy6TkAyrvsfzYPZbIA2y+ovOBzq6WCUT9gp5Jx/UE6CxLSmAuGPAJkV5D/twKIe75xc+5jdi3I1cgKw== user@laptop Copy that whole line, starting with "ssh-rsa" all the way through the very last character, including the "user@laptop" bit (which may be different for you, just be sure to include it in the line copy). Then back in the OpenStack Key Pair dialogue window, paste in the keypair in the "Public Key" window, then click "Import Key". The key should then appear in the key list. ==Launch a New Instance== We are now ready to launch our new VM instance. On the left navigation menu, select "Project", then in the submenu, select "Compute", and finally select "Instances". You will see any currently running instances in your group in the resulting screen. Next you need to click the "Launch Instance" button on the top right. You will be put into the "Details" tab in the instance creation dialogue. You need to choose an instance name and enter it into the "Instance Name" field. It should include your username as a prefix so that others know who owns each instance. Something like "frank-newtest1" would work well. You can ignore the "Description" field, "Availability Zone" should be "nova: and "Count" should be "1". Next click the "Source" tab on the left. In the "Source" menu, in the "Select Boot Source" field, select "Image". Then in the below list of images, choose your image and click the little "Up Arrow" icon to the right of the image you want to add it. Next click the "Flavor" tab on the left. In that menu, choose how much CPU, RAM and disk space you want for your new VM. Some images have minimum requirements, and as such some of the smaller flavors may not be available. Select your flavor by clicking the little "Up Arrow" icon on the right of your flavor. Next click the "Key Pair" tab on the left. Click the little "Up Arrow" to the right of the Kep Pair you created in the previous step where you create a Key Pair. Ignore the rest of the options on the left, you have configured all you need to launch the instance. Click the blue "Launch Instance" button on the bottom right of your window, as seen below: [[File:Launch.png|850px]] You will be taken back to the Instances Summary page and you should see your new instance launching. After a bit your instance will change from the "Spawning" to "Running". This means the instance is now booting, and should finish booting in a minute or two. In the meantime we will need to attach a "Floating IP" address to your instance such that you can SSH into the instance. On the right side of your running instance, you should see a drop-down menu, usually the "Create Snapshot" option is pre-selected. Click the drop down menu arrow to open that menu, and select "Associate Floating IP". In the "Associate Floating IP" dialogue, click the drop down menu to see if any IP addresses are already available, and if so, go ahead and select one. If there are none available, click the little "+" button to the right to allocate a floating IP address. It will ask you what Pool to use, select "ext-net". You can put in a description if you want but most folks leave that field blank. Then click "Allocate IP". It will take you back one menu level. It will have a field "Port to be Associated", just leave that alone with the default that is already there. Click the blue "Associate" button on the bottom right of the window. You will be returned to the "Instances Summary" page again. You will see your instance running, and it should now list a "Floating IP" that it is running under. That is the IP that you will use to SSH to the instance. ==Connect to Your New Instance== Now that your instance is up and running, let's SSH to it and get going! '''From the computer you created your SSH keys on,''' SSH to your instance using the username as the OS type you chose (ubuntu, centos, etc), and the Floating IP address your instance has. '''You must be connected to the VPN for this to work!''' Example: $ ssh ubuntu@10.50.100.67 If you launched a CentOS instance, it would instead be "ssh centos@10.50.100.67", as appropriate. Assuming everything went as planned, you will be logged into your new Linux instance as the "ubuntu" or "centos" user, which is an unprivileged user. If you get a "Connection Refused" error when trying to SSH in, it means your instance isn't quite through launching yet, try again in about 30 seconds. You have full sudo rights however to do whatever administration you need to do. At this point it is assumed you have a little systems administration skills in your belt, or at least have some time to query Google as to how to perform various Linux tasks as necessary. Your instance has full Internet access to the Greater Internet, so you can download thing fro the Internet, run "apt-get install" or "yum update" or whatever is appropriate. You can also then install any needed software you need to get your work done. '''NOTE:''' Your are the Systems Administrator of your instance - we cannot support questions on how to administer Linux for you. If OpenStack itself is having issues then please let us know, but please defer questions like "How do I install software on Ubuntu" to Google searches. ==Storage on Your New Instance== Most of your storage on your new instance will be located in the /mnt directory, as seen by a "df -h" command on your instance: ubuntu@erich1:~$ df -h Filesystem Size Used Avail Use% Mounted on udev 16G 0 16G 0% /dev tmpfs 3.2G 676K 3.2G 1% /run /dev/vda1 20G 975M 19G 5% / tmpfs 16G 0 16G 0% /dev/shm tmpfs 5.0M 0 5.0M 0% /run/lock tmpfs 16G 0 16G 0% /sys/fs/cgroup /dev/vda15 105M 3.4M 102M 4% /boot/efi /dev/vdb1 1.0T 1.1G 1023G 1% /mnt tmpfs 3.2G 0 3.2G 0% /run/user/1000 Notice that "/mnt" has 1TB of disk space, so store all your big important data in /mnt. Avoid storing data on "/" whenever possible to prevent issues with the root filesystem filling up. The exact amount of storage available will depend on what flavor you chose when creating the instance. ==Instance Control Options== Just a few notes on controlling your instances. They are fully functioning Linux machines, so a "sudo reboot" will reboot the machine, "sudo poweroff" will shut it down, etc. In cloud parlance, "Shut Down" means the instance is still there but the power is off. "Terminated" means it's fully deleted and is unrecoverable, so be sure you want to delete your instance before you do so. We do not back instances up. We also have no access to your instance so we cannot log in and see what's going on. You can control your instance in several ways from the OpenStack web interface, in the Instance Summary page. On the right side of your instance in the list will be that little drop down menu. Options of interest are: '''1: Create Snapshot''' Never use this option as we have not implemented snapshotting in this environment. '''2: View Log''' This will show you the boot/console log of the instance, so you can see if anything is causing issues. '''3: Hard Reboot Instance''' This will hard reboot your instance, kind of like hitting the power button to power the instance off, then it will power back on moments later. Useful if your instance is hosed because of a software crash or other things that may have crashed the instance. '''4: Delete Instance''' This will permanently destroy your instance. It will be deleted and is unrecoverable. It will also free up the resources it was using such that others can use them however. This is useful if the group quotas have been reached and some old instances need to be cleaned out to make room for new ones. '''5: Start Instance''' This option will be available if the instance is in the "Shut Down" state. It will boot up the instance when this option is invoked. Do not use the other options you may see there, most have not been implemented in our deployment of OpenStack. ==Changing Your OpenStack Web Interface Password== Once you have logged in to the Web Interface, you can change your password by doing the following. On the top right of the OpenStack web interface, you should see a little icon with your username on it. Click that icon to expand the drop down menu there, and select "Settings". Then in the next window, on the left navigation bar, you should see the "Change Password" button. Complete the Change Password dialogue to change your password. You may have to log in again after changing your password. ==Etiquette== There is one main thing to remember when using instances in OpenStack. When you create an instance, it uses CPU, RAM and most importantly, it pins disk space for that instance. If you use up all the disk, CPU and RAM quota for your group, then others have no resources left to create their own instances. It is important to know that the best plan of action is to fire up your VM and keep it up when you need it, and then copy your data off it and delete the instance. Document steps taken to create your instance such that you could do it again if you needed to. If the physical node that your instance resides on blows up, then your instance is lost forever and we have no backups, so it is up to you to back up important data. It's also not good form to spin up an instance and store data there, but not log in for months at a time. Then, you are pinning resources that other may need for urgent work. Try to be a good neighbor! 1cc0bffcc1a63866f27372b9abe03e58960881a1 207 206 2020-04-28T23:02:48Z Weiler 3 wikitext text/x-wiki __TOC__ ==Request an OpenStack Account== Once you have [http://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access PRISM/GI VPN access], you can request an OpenStack account. You will need to send an email to cluster-admin@soe.ucsc.edu asking for access, and let us know which lab you are in, or who your PI is, so we can place you in the right OpenStack group. ==Create a SSH Public/Private Keypair== To log into an OpenStack VM instance, you will need a SSH public key. The key is "injected" into the instance upon creation, and only someone with that key (i.e. you) will be able to log in via SSH initially. If you already have a SSH public and private key that you use elsewhere, you can use that one, and can skip to the next step. If you don't have a SSH keypair set up yet, then you will need to log into the UNIX compatible machine you will be '''logging in from''' (a Mac/Apple computer will also work), and then run the 'ssh-keygen' command. If you are behind the VPN, you can first log into mustard, crimson or razzmatazz, which are linux servers. The command will look something like this: $ ssh-keygen -t rsa Generating public/private rsa key pair. Enter file in which to save the key (/public/home/frank/.ssh/id_rsa): Created directory '/public/home/frank/.ssh'. Enter passphrase (empty for no passphrase): [JUST HIT ENTER] Enter same passphrase again: [JUST HIT ENTER] Your identification has been saved in /public/home/frank/.ssh/id_rsa. Your public key has been saved in /public/home/frank/.ssh/id_rsa.pub. The key fingerprint is: SHA256:dhJG1A3gcwj7Mz17ommt3NIczMVVgrzp8Tf6F1X4jpI The key's randomart image is: +---[RSA 2048]----+ | ..+o.+ ..o.| | = .. + o..| | . * .. + ..| | o = * o| | So+o + o.| | . =+oE ooo| | +o.....o| | .o++o . .| | .=o. ...| +----[SHA256]-----+ You will then have a new directory, "~/.ssh", and inside that directory you will have a file called "id_rsa.pub". That is your SSH public key. You will need this in the next step in order to set up your key in OpenStack. ==Log In To giCloud== Once you have been notified that your account has been set up and have been given login credentials, connect to the VPN and then go to this link in your favorite web browser, which is the login page: https://gicloud.prism To login, enter your username and password. Also you will see a "Domain" field, just enter the word "default" for the domain. Click "Log In". You will be logged into your group's summary page. ==Upload your SSH Public Key== After creating your new key in the above "Create a SSH Public/Private Keypair" step, you will need to upload that key into OpenStack. Once you are logged in, on the left hand navigation menu, click "Project", then in the submenu, select "Compute", and finally select "Key Pairs". It should take you to the "Key Pairs" window as shown here. [[File:Keypairs.png|900px]] Next click the "Import Public Key" button on the top right of the window. In the resulting window, name your key in the "Key Pair Name" field. Name it something descriptive like "laptop-key" if the key is on your laptop, or "mustard-key" if you are logged into mustard, etc. To get your key, open a terminal window and type "cat ~/.ssh/id_rsa.pub" to get your full key, as so: $ cat ~/.ssh/id_rsa.pub ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAyVKNfdBbDIk7Iq8JmL+u3vxAn4M1iaQgMU5tHJhMSAYBZEZRLZAFc+Qovxe5zzs1ixte9lCipLep39q2I4U8XND17nYliZ4HVM4MW4GsMUfKsgX2FI3mB2vAQ9pZSLkAhTg2D+92uALUSSv1cDZhTqo7DuPRX2Upxyd5QbRL6TRFswBjHz2vY/JpaPQm1S1d10mokPpmxehLfwp0mVgmz1Uv/6FflqiZ68DhDN67cs1yQgWYXQ01IHPjzTKRwCuZVkgT99rkoqy6TkAyrvsfzYPZbIA2y+ovOBzq6WCUT9gp5Jx/UE6CxLSmAuGPAJkV5D/twKIe75xc+5jdi3I1cgKw== user@laptop Copy that whole line, starting with "ssh-rsa" all the way through the very last character, including the "user@laptop" bit (which may be different for you, just be sure to include it in the line copy). Then back in the OpenStack Key Pair dialogue window, paste in the keypair in the "Public Key" window, then click "Import Key". The key should then appear in the key list. ==Launch a New Instance== We are now ready to launch our new VM instance. On the left navigation menu, select "Project", then in the submenu, select "Compute", and finally select "Instances". You will see any currently running instances in your group in the resulting screen. Next you need to click the "Launch Instance" button on the top right. You will be put into the "Details" tab in the instance creation dialogue. You need to choose an instance name and enter it into the "Instance Name" field. It should include your username as a prefix so that others know who owns each instance. Something like "frank-newtest1" would work well. You can ignore the "Description" field, "Availability Zone" should be "nova: and "Count" should be "1". Next click the "Source" tab on the left. In the "Source" menu, in the "Select Boot Source" field, select "Image". Then in the below list of images, choose your image and click the little "Up Arrow" icon to the right of the image you want to add it. Next click the "Flavor" tab on the left. In that menu, choose how much CPU, RAM and disk space you want for your new VM. Some images have minimum requirements, and as such some of the smaller flavors may not be available. Select your flavor by clicking the little "Up Arrow" icon on the right of your flavor. Next click the "Key Pair" tab on the left. Click the little "Up Arrow" to the right of the Kep Pair you created in the previous step where you create a Key Pair. Ignore the rest of the options on the left, you have configured all you need to launch the instance. Click the blue "Launch Instance" button on the bottom right of your window, as seen below: [[File:Launch.png|850px]] You will be taken back to the Instances Summary page and you should see your new instance launching. After a bit your instance will change from the "Spawning" to "Running". This means the instance is now booting, and should finish booting in a minute or two. In the meantime we will need to attach a "Floating IP" address to your instance such that you can SSH into the instance. On the right side of your running instance, you should see a drop-down menu, usually the "Create Snapshot" option is pre-selected. Click the drop down menu arrow to open that menu, and select "Associate Floating IP". In the "Associate Floating IP" dialogue, click the drop down menu to see if any IP addresses are already available, and if so, go ahead and select one. If there are none available, click the little "+" button to the right to allocate a floating IP address. It will ask you what Pool to use, select "ext-net". You can put in a description if you want but most folks leave that field blank. Then click "Allocate IP". It will take you back one menu level. It will have a field "Port to be Associated", just leave that alone with the default that is already there. Click the blue "Associate" button on the bottom right of the window. You will be returned to the "Instances Summary" page again. You will see your instance running, and it should now list a "Floating IP" that it is running under. That is the IP that you will use to SSH to the instance. ==Connect to Your New Instance== Now that your instance is up and running, let's SSH to it and get going! '''From the computer you created your SSH keys on,''' SSH to your instance using the username as the OS type you chose (ubuntu, centos, etc), and the Floating IP address your instance has. '''You must be connected to the VPN for this to work!''' Example: $ ssh ubuntu@10.50.100.67 If you launched a CentOS instance, it would instead be "ssh centos@10.50.100.67", as appropriate. Assuming everything went as planned, you will be logged into your new Linux instance as the "ubuntu" or "centos" user, which is an unprivileged user. If you get a "Connection Refused" error when trying to SSH in, it means your instance isn't quite through launching yet, try again in about 30 seconds. You have full sudo rights however to do whatever administration you need to do. At this point it is assumed you have a little systems administration skills in your belt, or at least have some time to query Google as to how to perform various Linux tasks as necessary. Your instance has full Internet access to the Greater Internet, so you can download thing fro the Internet, run "apt-get install" or "yum update" or whatever is appropriate. You can also then install any needed software you need to get your work done. '''NOTE:''' Your are the Systems Administrator of your instance - we cannot support questions on how to administer Linux for you. If OpenStack itself is having issues then please let us know, but please defer questions like "How do I install software on Ubuntu" to Google searches. ==Storage on Your New Instance== Most of your storage on your new instance will be located in the /mnt directory, as seen by a "df -h" command on your instance: ubuntu@erich1:~$ df -h Filesystem Size Used Avail Use% Mounted on udev 16G 0 16G 0% /dev tmpfs 3.2G 676K 3.2G 1% /run /dev/vda1 20G 975M 19G 5% / tmpfs 16G 0 16G 0% /dev/shm tmpfs 5.0M 0 5.0M 0% /run/lock tmpfs 16G 0 16G 0% /sys/fs/cgroup /dev/vda15 105M 3.4M 102M 4% /boot/efi /dev/vdb1 1.0T 1.1G 1023G 1% /mnt tmpfs 3.2G 0 3.2G 0% /run/user/1000 Notice that "/mnt" has 1TB of disk space, so store all your big important data in /mnt. Avoid storing data on "/" whenever possible to prevent issues with the root filesystem filling up. The exact amount of storage available will depend on what flavor you chose when creating the instance. ==Instance Control Options== Just a few notes on controlling your instances. They are fully functioning Linux machines, so a "sudo reboot" will reboot the machine, "sudo poweroff" will shut it down, etc. In cloud parlance, "Shut Down" means the instance is still there but the power is off. "Terminated" means it's fully deleted and is unrecoverable, so be sure you want to delete your instance before you do so. We do not back instances up. We also have no access to your instance so we cannot log in and see what's going on. You can control your instance in several ways from the OpenStack web interface, in the Instance Summary page. On the right side of your instance in the list will be that little drop down menu. Options of interest are: '''1: Create Snapshot''' Never use this option as we have not implemented snapshotting in this environment. '''2: View Log''' This will show you the boot/console log of the instance, so you can see if anything is causing issues. '''3: Hard Reboot Instance''' This will hard reboot your instance, kind of like hitting the power button to power the instance off, then it will power back on moments later. Useful if your instance is hosed because of a software crash or other things that may have crashed the instance. '''4: Delete Instance''' This will permanently destroy your instance. It will be deleted and is unrecoverable. It will also free up the resources it was using such that others can use them however. This is useful if the group quotas have been reached and some old instances need to be cleaned out to make room for new ones. '''5: Start Instance''' This option will be available if the instance is in the "Shut Down" state. It will boot up the instance when this option is invoked. Do not use the other options you may see there, most have not been implemented in our deployment of OpenStack. ==Changing Your OpenStack Web Interface Password== Once you have logged in to the Web Interface, you can change your password by doing the following. On the top right of the OpenStack web interface, you should see a little icon with your username on it. Click that icon to expand the drop down menu there, and select "Settings". Then in the next window, on the left navigation bar, you should see the "Change Password" button. Complete the Change Password dialogue to change your password. You may have to log in again after changing your password. ==Networking== Your instances are connected at 10Gb/s between each other and the internet. Of course, actual transfer speeds will likely vary based on disk speed, speed of the location to are transferring data to or from, and other factors. Your instance will be located in a private network that can only be seen by other instances in your group. Other OpenStack groups are logically separated into their own networks and your instance cannot route to them. Also, no one can access your instance unless they have a VPN account with us, so your instances are completely fenced off from the Greater Internet inbound, which means you are largely secure against script kiddies and hackers. You are able to connect outbound from your instances. ==Etiquette== There is one main thing to remember when using instances in OpenStack. When you create an instance, it uses CPU, RAM and most importantly, it pins disk space for that instance. If you use up all the disk, CPU and RAM quota for your group, then others have no resources left to create their own instances. It is important to know that the best plan of action is to fire up your VM and keep it up when you need it, and then copy your data off it and delete the instance. Document steps taken to create your instance such that you could do it again if you needed to. If the physical node that your instance resides on blows up, then your instance is lost forever and we have no backups, so it is up to you to back up important data. It's also not good form to spin up an instance and store data there, but not log in for months at a time. Then, you are pinning resources that other may need for urgent work. Try to be a good neighbor! 83fa723d9cafa430c63ae598c764df1f05cca083 210 207 2020-04-29T20:16:10Z Weiler 3 /* Log In To giCloud */ wikitext text/x-wiki __TOC__ ==Request an OpenStack Account== Once you have [http://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access PRISM/GI VPN access], you can request an OpenStack account. You will need to send an email to cluster-admin@soe.ucsc.edu asking for access, and let us know which lab you are in, or who your PI is, so we can place you in the right OpenStack group. ==Create a SSH Public/Private Keypair== To log into an OpenStack VM instance, you will need a SSH public key. The key is "injected" into the instance upon creation, and only someone with that key (i.e. you) will be able to log in via SSH initially. If you already have a SSH public and private key that you use elsewhere, you can use that one, and can skip to the next step. If you don't have a SSH keypair set up yet, then you will need to log into the UNIX compatible machine you will be '''logging in from''' (a Mac/Apple computer will also work), and then run the 'ssh-keygen' command. If you are behind the VPN, you can first log into mustard, crimson or razzmatazz, which are linux servers. The command will look something like this: $ ssh-keygen -t rsa Generating public/private rsa key pair. Enter file in which to save the key (/public/home/frank/.ssh/id_rsa): Created directory '/public/home/frank/.ssh'. Enter passphrase (empty for no passphrase): [JUST HIT ENTER] Enter same passphrase again: [JUST HIT ENTER] Your identification has been saved in /public/home/frank/.ssh/id_rsa. Your public key has been saved in /public/home/frank/.ssh/id_rsa.pub. The key fingerprint is: SHA256:dhJG1A3gcwj7Mz17ommt3NIczMVVgrzp8Tf6F1X4jpI The key's randomart image is: +---[RSA 2048]----+ | ..+o.+ ..o.| | = .. + o..| | . * .. + ..| | o = * o| | So+o + o.| | . =+oE ooo| | +o.....o| | .o++o . .| | .=o. ...| +----[SHA256]-----+ You will then have a new directory, "~/.ssh", and inside that directory you will have a file called "id_rsa.pub". That is your SSH public key. You will need this in the next step in order to set up your key in OpenStack. ==Log In To giCloud== Once you have been notified that your account has been set up and have been given login credentials, connect to the VPN and then go to this link in your favorite web browser, which is the login page: http://gicloud.prism To login, enter your username and password. Also you will see a "Domain" field, just enter the word "default" for the domain. Click "Log In". You will be logged into your group's summary page. ==Upload your SSH Public Key== After creating your new key in the above "Create a SSH Public/Private Keypair" step, you will need to upload that key into OpenStack. Once you are logged in, on the left hand navigation menu, click "Project", then in the submenu, select "Compute", and finally select "Key Pairs". It should take you to the "Key Pairs" window as shown here. [[File:Keypairs.png|900px]] Next click the "Import Public Key" button on the top right of the window. In the resulting window, name your key in the "Key Pair Name" field. Name it something descriptive like "laptop-key" if the key is on your laptop, or "mustard-key" if you are logged into mustard, etc. To get your key, open a terminal window and type "cat ~/.ssh/id_rsa.pub" to get your full key, as so: $ cat ~/.ssh/id_rsa.pub ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAyVKNfdBbDIk7Iq8JmL+u3vxAn4M1iaQgMU5tHJhMSAYBZEZRLZAFc+Qovxe5zzs1ixte9lCipLep39q2I4U8XND17nYliZ4HVM4MW4GsMUfKsgX2FI3mB2vAQ9pZSLkAhTg2D+92uALUSSv1cDZhTqo7DuPRX2Upxyd5QbRL6TRFswBjHz2vY/JpaPQm1S1d10mokPpmxehLfwp0mVgmz1Uv/6FflqiZ68DhDN67cs1yQgWYXQ01IHPjzTKRwCuZVkgT99rkoqy6TkAyrvsfzYPZbIA2y+ovOBzq6WCUT9gp5Jx/UE6CxLSmAuGPAJkV5D/twKIe75xc+5jdi3I1cgKw== user@laptop Copy that whole line, starting with "ssh-rsa" all the way through the very last character, including the "user@laptop" bit (which may be different for you, just be sure to include it in the line copy). Then back in the OpenStack Key Pair dialogue window, paste in the keypair in the "Public Key" window, then click "Import Key". The key should then appear in the key list. ==Launch a New Instance== We are now ready to launch our new VM instance. On the left navigation menu, select "Project", then in the submenu, select "Compute", and finally select "Instances". You will see any currently running instances in your group in the resulting screen. Next you need to click the "Launch Instance" button on the top right. You will be put into the "Details" tab in the instance creation dialogue. You need to choose an instance name and enter it into the "Instance Name" field. It should include your username as a prefix so that others know who owns each instance. Something like "frank-newtest1" would work well. You can ignore the "Description" field, "Availability Zone" should be "nova: and "Count" should be "1". Next click the "Source" tab on the left. In the "Source" menu, in the "Select Boot Source" field, select "Image". Then in the below list of images, choose your image and click the little "Up Arrow" icon to the right of the image you want to add it. Next click the "Flavor" tab on the left. In that menu, choose how much CPU, RAM and disk space you want for your new VM. Some images have minimum requirements, and as such some of the smaller flavors may not be available. Select your flavor by clicking the little "Up Arrow" icon on the right of your flavor. Next click the "Key Pair" tab on the left. Click the little "Up Arrow" to the right of the Kep Pair you created in the previous step where you create a Key Pair. Ignore the rest of the options on the left, you have configured all you need to launch the instance. Click the blue "Launch Instance" button on the bottom right of your window, as seen below: [[File:Launch.png|850px]] You will be taken back to the Instances Summary page and you should see your new instance launching. After a bit your instance will change from the "Spawning" to "Running". This means the instance is now booting, and should finish booting in a minute or two. In the meantime we will need to attach a "Floating IP" address to your instance such that you can SSH into the instance. On the right side of your running instance, you should see a drop-down menu, usually the "Create Snapshot" option is pre-selected. Click the drop down menu arrow to open that menu, and select "Associate Floating IP". In the "Associate Floating IP" dialogue, click the drop down menu to see if any IP addresses are already available, and if so, go ahead and select one. If there are none available, click the little "+" button to the right to allocate a floating IP address. It will ask you what Pool to use, select "ext-net". You can put in a description if you want but most folks leave that field blank. Then click "Allocate IP". It will take you back one menu level. It will have a field "Port to be Associated", just leave that alone with the default that is already there. Click the blue "Associate" button on the bottom right of the window. You will be returned to the "Instances Summary" page again. You will see your instance running, and it should now list a "Floating IP" that it is running under. That is the IP that you will use to SSH to the instance. ==Connect to Your New Instance== Now that your instance is up and running, let's SSH to it and get going! '''From the computer you created your SSH keys on,''' SSH to your instance using the username as the OS type you chose (ubuntu, centos, etc), and the Floating IP address your instance has. '''You must be connected to the VPN for this to work!''' Example: $ ssh ubuntu@10.50.100.67 If you launched a CentOS instance, it would instead be "ssh centos@10.50.100.67", as appropriate. Assuming everything went as planned, you will be logged into your new Linux instance as the "ubuntu" or "centos" user, which is an unprivileged user. If you get a "Connection Refused" error when trying to SSH in, it means your instance isn't quite through launching yet, try again in about 30 seconds. You have full sudo rights however to do whatever administration you need to do. At this point it is assumed you have a little systems administration skills in your belt, or at least have some time to query Google as to how to perform various Linux tasks as necessary. Your instance has full Internet access to the Greater Internet, so you can download thing fro the Internet, run "apt-get install" or "yum update" or whatever is appropriate. You can also then install any needed software you need to get your work done. '''NOTE:''' Your are the Systems Administrator of your instance - we cannot support questions on how to administer Linux for you. If OpenStack itself is having issues then please let us know, but please defer questions like "How do I install software on Ubuntu" to Google searches. ==Storage on Your New Instance== Most of your storage on your new instance will be located in the /mnt directory, as seen by a "df -h" command on your instance: ubuntu@erich1:~$ df -h Filesystem Size Used Avail Use% Mounted on udev 16G 0 16G 0% /dev tmpfs 3.2G 676K 3.2G 1% /run /dev/vda1 20G 975M 19G 5% / tmpfs 16G 0 16G 0% /dev/shm tmpfs 5.0M 0 5.0M 0% /run/lock tmpfs 16G 0 16G 0% /sys/fs/cgroup /dev/vda15 105M 3.4M 102M 4% /boot/efi /dev/vdb1 1.0T 1.1G 1023G 1% /mnt tmpfs 3.2G 0 3.2G 0% /run/user/1000 Notice that "/mnt" has 1TB of disk space, so store all your big important data in /mnt. Avoid storing data on "/" whenever possible to prevent issues with the root filesystem filling up. The exact amount of storage available will depend on what flavor you chose when creating the instance. ==Instance Control Options== Just a few notes on controlling your instances. They are fully functioning Linux machines, so a "sudo reboot" will reboot the machine, "sudo poweroff" will shut it down, etc. In cloud parlance, "Shut Down" means the instance is still there but the power is off. "Terminated" means it's fully deleted and is unrecoverable, so be sure you want to delete your instance before you do so. We do not back instances up. We also have no access to your instance so we cannot log in and see what's going on. You can control your instance in several ways from the OpenStack web interface, in the Instance Summary page. On the right side of your instance in the list will be that little drop down menu. Options of interest are: '''1: Create Snapshot''' Never use this option as we have not implemented snapshotting in this environment. '''2: View Log''' This will show you the boot/console log of the instance, so you can see if anything is causing issues. '''3: Hard Reboot Instance''' This will hard reboot your instance, kind of like hitting the power button to power the instance off, then it will power back on moments later. Useful if your instance is hosed because of a software crash or other things that may have crashed the instance. '''4: Delete Instance''' This will permanently destroy your instance. It will be deleted and is unrecoverable. It will also free up the resources it was using such that others can use them however. This is useful if the group quotas have been reached and some old instances need to be cleaned out to make room for new ones. '''5: Start Instance''' This option will be available if the instance is in the "Shut Down" state. It will boot up the instance when this option is invoked. Do not use the other options you may see there, most have not been implemented in our deployment of OpenStack. ==Changing Your OpenStack Web Interface Password== Once you have logged in to the Web Interface, you can change your password by doing the following. On the top right of the OpenStack web interface, you should see a little icon with your username on it. Click that icon to expand the drop down menu there, and select "Settings". Then in the next window, on the left navigation bar, you should see the "Change Password" button. Complete the Change Password dialogue to change your password. You may have to log in again after changing your password. ==Networking== Your instances are connected at 10Gb/s between each other and the internet. Of course, actual transfer speeds will likely vary based on disk speed, speed of the location to are transferring data to or from, and other factors. Your instance will be located in a private network that can only be seen by other instances in your group. Other OpenStack groups are logically separated into their own networks and your instance cannot route to them. Also, no one can access your instance unless they have a VPN account with us, so your instances are completely fenced off from the Greater Internet inbound, which means you are largely secure against script kiddies and hackers. You are able to connect outbound from your instances. ==Etiquette== There is one main thing to remember when using instances in OpenStack. When you create an instance, it uses CPU, RAM and most importantly, it pins disk space for that instance. If you use up all the disk, CPU and RAM quota for your group, then others have no resources left to create their own instances. It is important to know that the best plan of action is to fire up your VM and keep it up when you need it, and then copy your data off it and delete the instance. Document steps taken to create your instance such that you could do it again if you needed to. If the physical node that your instance resides on blows up, then your instance is lost forever and we have no backups, so it is up to you to back up important data. It's also not good form to spin up an instance and store data there, but not log in for months at a time. Then, you are pinning resources that other may need for urgent work. Try to be a good neighbor! 9af0682c3a53f22ad51048917b3938299d63dfa0 215 210 2020-07-16T23:23:49Z Jgarcia 2 /* Launch a New Instance */ wikitext text/x-wiki __TOC__ ==Request an OpenStack Account== Once you have [http://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access PRISM/GI VPN access], you can request an OpenStack account. You will need to send an email to cluster-admin@soe.ucsc.edu asking for access, and let us know which lab you are in, or who your PI is, so we can place you in the right OpenStack group. ==Create a SSH Public/Private Keypair== To log into an OpenStack VM instance, you will need a SSH public key. The key is "injected" into the instance upon creation, and only someone with that key (i.e. you) will be able to log in via SSH initially. If you already have a SSH public and private key that you use elsewhere, you can use that one, and can skip to the next step. If you don't have a SSH keypair set up yet, then you will need to log into the UNIX compatible machine you will be '''logging in from''' (a Mac/Apple computer will also work), and then run the 'ssh-keygen' command. If you are behind the VPN, you can first log into mustard, crimson or razzmatazz, which are linux servers. The command will look something like this: $ ssh-keygen -t rsa Generating public/private rsa key pair. Enter file in which to save the key (/public/home/frank/.ssh/id_rsa): Created directory '/public/home/frank/.ssh'. Enter passphrase (empty for no passphrase): [JUST HIT ENTER] Enter same passphrase again: [JUST HIT ENTER] Your identification has been saved in /public/home/frank/.ssh/id_rsa. Your public key has been saved in /public/home/frank/.ssh/id_rsa.pub. The key fingerprint is: SHA256:dhJG1A3gcwj7Mz17ommt3NIczMVVgrzp8Tf6F1X4jpI The key's randomart image is: +---[RSA 2048]----+ | ..+o.+ ..o.| | = .. + o..| | . * .. + ..| | o = * o| | So+o + o.| | . =+oE ooo| | +o.....o| | .o++o . .| | .=o. ...| +----[SHA256]-----+ You will then have a new directory, "~/.ssh", and inside that directory you will have a file called "id_rsa.pub". That is your SSH public key. You will need this in the next step in order to set up your key in OpenStack. ==Log In To giCloud== Once you have been notified that your account has been set up and have been given login credentials, connect to the VPN and then go to this link in your favorite web browser, which is the login page: http://gicloud.prism To login, enter your username and password. Also you will see a "Domain" field, just enter the word "default" for the domain. Click "Log In". You will be logged into your group's summary page. ==Upload your SSH Public Key== After creating your new key in the above "Create a SSH Public/Private Keypair" step, you will need to upload that key into OpenStack. Once you are logged in, on the left hand navigation menu, click "Project", then in the submenu, select "Compute", and finally select "Key Pairs". It should take you to the "Key Pairs" window as shown here. [[File:Keypairs.png|900px]] Next click the "Import Public Key" button on the top right of the window. In the resulting window, name your key in the "Key Pair Name" field. Name it something descriptive like "laptop-key" if the key is on your laptop, or "mustard-key" if you are logged into mustard, etc. To get your key, open a terminal window and type "cat ~/.ssh/id_rsa.pub" to get your full key, as so: $ cat ~/.ssh/id_rsa.pub ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAyVKNfdBbDIk7Iq8JmL+u3vxAn4M1iaQgMU5tHJhMSAYBZEZRLZAFc+Qovxe5zzs1ixte9lCipLep39q2I4U8XND17nYliZ4HVM4MW4GsMUfKsgX2FI3mB2vAQ9pZSLkAhTg2D+92uALUSSv1cDZhTqo7DuPRX2Upxyd5QbRL6TRFswBjHz2vY/JpaPQm1S1d10mokPpmxehLfwp0mVgmz1Uv/6FflqiZ68DhDN67cs1yQgWYXQ01IHPjzTKRwCuZVkgT99rkoqy6TkAyrvsfzYPZbIA2y+ovOBzq6WCUT9gp5Jx/UE6CxLSmAuGPAJkV5D/twKIe75xc+5jdi3I1cgKw== user@laptop Copy that whole line, starting with "ssh-rsa" all the way through the very last character, including the "user@laptop" bit (which may be different for you, just be sure to include it in the line copy). Then back in the OpenStack Key Pair dialogue window, paste in the keypair in the "Public Key" window, then click "Import Key". The key should then appear in the key list. ==Launch a New Instance== We are now ready to launch our new VM instance. On the left navigation menu, select "Project", then in the submenu, select "Compute", and finally select "Instances". You will see any currently running instances in your group in the resulting screen. Next you need to click the "Launch Instance" button on the top right. You will be put into the "Details" tab in the instance creation dialogue. You need to choose an instance name and enter it into the "Instance Name" field. It should include your username as a prefix so that others know who owns each instance. Something like "frank-newtest1" would work well. You can ignore the "Description" field, "Availability Zone" should be "nova: and "Count" should be "1". Next click the "Source" tab on the left. In the "Source" menu, in the "Select Boot Source" field, select "Image" and next to it select "No" for "Create New Volume". Then in the below list of images, choose your image and click the little "Up Arrow" icon to the right of the image you want to add it. Next click the "Flavor" tab on the left. In that menu, choose how much CPU, RAM and disk space you want for your new VM. Some images have minimum requirements, and as such some of the smaller flavors may not be available. Select your flavor by clicking the little "Up Arrow" icon on the right of your flavor. Next click the "Key Pair" tab on the left. Click the little "Up Arrow" to the right of the Kep Pair you created in the previous step where you create a Key Pair. Ignore the rest of the options on the left, you have configured all you need to launch the instance. Click the blue "Launch Instance" button on the bottom right of your window, as seen below: [[File:Launch.png|850px]] You will be taken back to the Instances Summary page and you should see your new instance launching. After a bit your instance will change from the "Spawning" to "Running". This means the instance is now booting, and should finish booting in a minute or two. In the meantime we will need to attach a "Floating IP" address to your instance such that you can SSH into the instance. On the right side of your running instance, you should see a drop-down menu, usually the "Create Snapshot" option is pre-selected. Click the drop down menu arrow to open that menu, and select "Associate Floating IP". In the "Associate Floating IP" dialogue, click the drop down menu to see if any IP addresses are already available, and if so, go ahead and select one. If there are none available, click the little "+" button to the right to allocate a floating IP address. It will ask you what Pool to use, select "ext-net". You can put in a description if you want but most folks leave that field blank. Then click "Allocate IP". It will take you back one menu level. It will have a field "Port to be Associated", just leave that alone with the default that is already there. Click the blue "Associate" button on the bottom right of the window. You will be returned to the "Instances Summary" page again. You will see your instance running, and it should now list a "Floating IP" that it is running under. That is the IP that you will use to SSH to the instance. ==Connect to Your New Instance== Now that your instance is up and running, let's SSH to it and get going! '''From the computer you created your SSH keys on,''' SSH to your instance using the username as the OS type you chose (ubuntu, centos, etc), and the Floating IP address your instance has. '''You must be connected to the VPN for this to work!''' Example: $ ssh ubuntu@10.50.100.67 If you launched a CentOS instance, it would instead be "ssh centos@10.50.100.67", as appropriate. Assuming everything went as planned, you will be logged into your new Linux instance as the "ubuntu" or "centos" user, which is an unprivileged user. If you get a "Connection Refused" error when trying to SSH in, it means your instance isn't quite through launching yet, try again in about 30 seconds. You have full sudo rights however to do whatever administration you need to do. At this point it is assumed you have a little systems administration skills in your belt, or at least have some time to query Google as to how to perform various Linux tasks as necessary. Your instance has full Internet access to the Greater Internet, so you can download thing fro the Internet, run "apt-get install" or "yum update" or whatever is appropriate. You can also then install any needed software you need to get your work done. '''NOTE:''' Your are the Systems Administrator of your instance - we cannot support questions on how to administer Linux for you. If OpenStack itself is having issues then please let us know, but please defer questions like "How do I install software on Ubuntu" to Google searches. ==Storage on Your New Instance== Most of your storage on your new instance will be located in the /mnt directory, as seen by a "df -h" command on your instance: ubuntu@erich1:~$ df -h Filesystem Size Used Avail Use% Mounted on udev 16G 0 16G 0% /dev tmpfs 3.2G 676K 3.2G 1% /run /dev/vda1 20G 975M 19G 5% / tmpfs 16G 0 16G 0% /dev/shm tmpfs 5.0M 0 5.0M 0% /run/lock tmpfs 16G 0 16G 0% /sys/fs/cgroup /dev/vda15 105M 3.4M 102M 4% /boot/efi /dev/vdb1 1.0T 1.1G 1023G 1% /mnt tmpfs 3.2G 0 3.2G 0% /run/user/1000 Notice that "/mnt" has 1TB of disk space, so store all your big important data in /mnt. Avoid storing data on "/" whenever possible to prevent issues with the root filesystem filling up. The exact amount of storage available will depend on what flavor you chose when creating the instance. ==Instance Control Options== Just a few notes on controlling your instances. They are fully functioning Linux machines, so a "sudo reboot" will reboot the machine, "sudo poweroff" will shut it down, etc. In cloud parlance, "Shut Down" means the instance is still there but the power is off. "Terminated" means it's fully deleted and is unrecoverable, so be sure you want to delete your instance before you do so. We do not back instances up. We also have no access to your instance so we cannot log in and see what's going on. You can control your instance in several ways from the OpenStack web interface, in the Instance Summary page. On the right side of your instance in the list will be that little drop down menu. Options of interest are: '''1: Create Snapshot''' Never use this option as we have not implemented snapshotting in this environment. '''2: View Log''' This will show you the boot/console log of the instance, so you can see if anything is causing issues. '''3: Hard Reboot Instance''' This will hard reboot your instance, kind of like hitting the power button to power the instance off, then it will power back on moments later. Useful if your instance is hosed because of a software crash or other things that may have crashed the instance. '''4: Delete Instance''' This will permanently destroy your instance. It will be deleted and is unrecoverable. It will also free up the resources it was using such that others can use them however. This is useful if the group quotas have been reached and some old instances need to be cleaned out to make room for new ones. '''5: Start Instance''' This option will be available if the instance is in the "Shut Down" state. It will boot up the instance when this option is invoked. Do not use the other options you may see there, most have not been implemented in our deployment of OpenStack. ==Changing Your OpenStack Web Interface Password== Once you have logged in to the Web Interface, you can change your password by doing the following. On the top right of the OpenStack web interface, you should see a little icon with your username on it. Click that icon to expand the drop down menu there, and select "Settings". Then in the next window, on the left navigation bar, you should see the "Change Password" button. Complete the Change Password dialogue to change your password. You may have to log in again after changing your password. ==Networking== Your instances are connected at 10Gb/s between each other and the internet. Of course, actual transfer speeds will likely vary based on disk speed, speed of the location to are transferring data to or from, and other factors. Your instance will be located in a private network that can only be seen by other instances in your group. Other OpenStack groups are logically separated into their own networks and your instance cannot route to them. Also, no one can access your instance unless they have a VPN account with us, so your instances are completely fenced off from the Greater Internet inbound, which means you are largely secure against script kiddies and hackers. You are able to connect outbound from your instances. ==Etiquette== There is one main thing to remember when using instances in OpenStack. When you create an instance, it uses CPU, RAM and most importantly, it pins disk space for that instance. If you use up all the disk, CPU and RAM quota for your group, then others have no resources left to create their own instances. It is important to know that the best plan of action is to fire up your VM and keep it up when you need it, and then copy your data off it and delete the instance. Document steps taken to create your instance such that you could do it again if you needed to. If the physical node that your instance resides on blows up, then your instance is lost forever and we have no backups, so it is up to you to back up important data. It's also not good form to spin up an instance and store data there, but not log in for months at a time. Then, you are pinning resources that other may need for urgent work. Try to be a good neighbor! d9aabcc96feb7a6fa0b39bf767ae4a02691dea10 AWS Account List and Numbers 0 22 208 173 2020-04-29T18:18:40Z Weiler 3 wikitext text/x-wiki This is a list of our currently available AWS accounts and their account numbers: ucsc-bd2k : 862902209576 ucsc-toil-dev : 318423852362 ucsc-vg-dev : 781907127277 ucsc-platform-dev : 719818754276 comparative-genomics-dev : 162786355865 nanopore-dev : 270442831226 ucsc-cgp-production : 097093801910 platform-sc : 122796619775 anvil-dev : 608666466534 gi-gateway : 652235167018 pangenomics : 422448306679 braingeneers : 443872533066 ucsctreehouse : 238605363322 ucsc-bisti-dev : 851631505710 dockstore-dev : 635220370222 ucsc-spatial : 541180793903 290f863f9ab3da7844ddef314d1f57ddc7c621a2 209 208 2020-04-29T18:18:53Z Weiler 3 wikitext text/x-wiki This is a list of our currently available AWS accounts and their account numbers: ucsc-bd2k : 862902209576 ucsc-toil-dev : 318423852362 ucsc-vg-dev : 781907127277 ucsc-platform-dev : 719818754276 comparative-genomics-dev : 162786355865 nanopore-dev : 270442831226 ucsc-cgp-production : 097093801910 platform-sc : 122796619775 anvil-dev : 608666466534 gi-gateway : 652235167018 pangenomics : 422448306679 braingeneers : 443872533066 ucsctreehouse : 238605363322 ucsc-bisti-dev : 851631505710 dockstore-dev : 635220370222 ucsc-spatial : 541180793903 94219ac9ad718602062795b525d03c74e88a40b6 211 209 2020-05-05T21:06:23Z Weiler 3 wikitext text/x-wiki This is a list of our currently available AWS accounts and their account numbers: ucsc-bd2k : 862902209576 ucsc-toil-dev : 318423852362 ucsc-vg-dev : 781907127277 ucsc-platform-dev : 719818754276 comparative-genomics-dev : 162786355865 nanopore-dev : 270442831226 ucsc-cgp-production : 097093801910 platform-sc : 122796619775 anvil-dev : 608666466534 gi-gateway : 652235167018 pangenomics : 422448306679 braingeneers : 443872533066 ucsctreehouse : 238605363322 ucsc-bisti-dev : 851631505710 dockstore-dev : 635220370222 ucsc-spatial : 541180793903 platform-hca : 542754589326 0647d35ce5ce64b6ffd6a2a169cf95a30a92f243 212 211 2020-05-29T16:42:45Z Weiler 3 wikitext text/x-wiki This is a list of our currently available AWS accounts and their account numbers: ucsc-bd2k : 862902209576 ucsc-toil-dev : 318423852362 ucsc-vg-dev : 781907127277 ucsc-platform-dev : 719818754276 comparative-genomics-dev : 162786355865 nanopore-dev : 270442831226 ucsc-cgp-production : 097093801910 platform-hca-dev : 122796619775 anvil-dev : 608666466534 gi-gateway : 652235167018 pangenomics : 422448306679 braingeneers : 443872533066 ucsctreehouse : 238605363322 ucsc-bisti-dev : 851631505710 dockstore-dev : 635220370222 ucsc-spatial : 541180793903 platform-hca-pr : 542754589326 197beaced8eeac0bc0955c0e8a571ff52420f402 213 212 2020-05-29T16:43:04Z Weiler 3 wikitext text/x-wiki This is a list of our currently available AWS accounts and their account numbers: ucsc-bd2k : 862902209576 ucsc-toil-dev : 318423852362 ucsc-vg-dev : 781907127277 ucsc-platform-dev : 719818754276 comparative-genomics-dev : 162786355865 nanopore-dev : 270442831226 ucsc-cgp-production : 097093801910 platform-hca-dev : 122796619775 anvil-dev : 608666466534 gi-gateway : 652235167018 pangenomics : 422448306679 braingeneers : 443872533066 ucsctreehouse : 238605363322 ucsc-bisti-dev : 851631505710 dockstore-dev : 635220370222 ucsc-spatial : 541180793903 platform-hca-prod : 542754589326 9d9cd8036d7fb039d4b40d412c699250f0fad3b0 240 213 2021-10-19T18:14:14Z Weiler 3 wikitext text/x-wiki This is a list of our currently available AWS accounts and their account numbers: ucsc-bd2k : 862902209576 ucsc-toil-dev : 318423852362 ucsc-vg-dev : 781907127277 ucsc-platform-dev : 719818754276 comparative-genomics-dev : 162786355865 nanopore-dev : 270442831226 ucsc-cgp-production : 097093801910 platform-hca-dev : 122796619775 anvil-dev : 608666466534 gi-gateway : 652235167018 pangenomics : 422448306679 braingeneers : 443872533066 ucsctreehouse : 238605363322 ucsc-bisti-dev : 851631505710 dockstore-dev : 635220370222 ucsc-spatial : 541180793903 platform-hca-prod : 542754589326 miga-lab : 156518225147 31007067160da253f5ed94c015812853501c5aa2 Computational Genomics Kubernetes Installation 0 23 214 180 2020-06-23T23:23:05Z Anovak 4 Suggest --no-progress and talk about AWS secrets wikitext text/x-wiki __TOC__ The Computational Genomics Group has a Kubernetes Cluster running on several large instances in AWS. The current cluster makeup includes three worker nodes, each with the following specs: * 96 CPU cores (3.1 GHz) * 384 GB RAM * 3.3 TB Local NVMe Flash Storage * 25 Gb/s Network Interface ==Getting Authorized to Connect== If you require access to this kubernetes cluster, contact Benedict Paten asking for permission to use it, then pass on that permission via email to: cluster-admin@soe.ucsc.edu Let us know which group you are with and we can authorize you to use the cluster in the correct namespace. ==Authenticating to Kubernetes== We will authorize (authz) you to use the cluster on the server side, but you will also need to authenticate (authn) using your '@ucsc.edu' email address and a unique Java Web token. These credentials are installed in ~/.kube/config in whatever machine you are coming from to get to the cluster. To authenticate and get your base kubernetes configuration, go to this URL (below), which will ask you to authenticate to Google. Use your '@ucsc.edu' email address as the login. It will then ask you to authenticate via CruzID Gold if your web browser doesn't already have the authentication token cached: https://cg-kube-auth.gi.ucsc.edu Once you authenticate (via username/password and 2-factor auth for CruzID Gold), it will pass you back to the 'https://cg-kube-auth.gi.ucsc.edu' website and it should confirm authentication on the top with a message saying "Successfully Authenticated". '''If you see any errors in red,''' but are sure you typed in your password and 2-factor auth correctly, click on the above link again (https://cg-kube-auth.gi.ucsc.edu) and authenticate a second time, which should work. There is a quirk where the web token doesn't always pass back to us correctly on the first try. Upon success, you will be able to click the blue "Download Config File" button, which contains your initial kubernetes config file. Copy this file to your home directory as ~/.kube/config. Follow the directions on the web page to insert your '''"namespace:"''' line as directed. We will let you know which namespace to use. ==Testing Connectivity== Once your ~/.kube/config file is set up correctly, you should be able to connect to the cluster. All our shared servers here at the Genomics Institute have the 'kubectl' command installed on them, but if you are coming from somewhere else, just make sure the "kubectl" utility is installed on that machine. A quick test should go as follows: $ kubectl get nodes NAME STATUS ROLES AGE VERSION k1.kube Ready <none> 13h v1.15.3 k2.kube Ready <none> 13h v1.15.3 k3.kube Ready <none> 13h v1.15.3 master.kube Ready master 13h v1.15.3 ==Running Pods and Jobs with Requests and Limits== When running jobs and pods on kubernetes, you will always want to specify "requests" and "limits" on resources, otherwise your pods will get stuck with the default limits which are tiny (to protect against runaway pods). You should always have an idea of how much resources will be consumed by your jobs, and not use much more than that, in order not to hog all the resources. It also prevents your job from "running away" unexpectedly and chewing up more resources than expected. Here is a good example of a job file that specifies limits: job.yml apiVersion: batch/v1 kind: Job metadata: name: $USER-$TS spec: backoffLimit: 0 ttlSecondsAfterFinished: 30 template: spec: containers: - name: magic image: robcurrie/ubuntu imagePullPolicy: Always resources: requests: cpu: "1" memory: "2G" ephemeral-storage: "2G" limits: cpu: "1" memory: "2G" ephemeral-storage: "3G" command: ["/bin/bash", "-c"] args: ['for i in {1..100}; do echo "$i: $(date)"; i=$((i+1)); sleep 1; done'] restartPolicy: Never priorityClassName: medium-priority Please note that the "request" and "limit" item fields should be the same. You would think that you could set the limit higher than the request, but in reality they need to match in order for the pod to stay within the kubernetes resource limit bubble. If you set the limit higher than the request, then you are risking the pod using more memory than the scheduler expects, and the node can start killing off random other colocated pods by way of OOM, which is very, very bad for the cluster. If you omit the "requests" section altogether, the limit values will be used, so if you use only one, use "limits". Also note the "priorityClassName" line. Available values are: high-priority medium-priority low-priority That affects how quickly your jobs move up the queue in the event there are a lot of queued jobs. Always use "medium-priority" as the default unless you specifically know you need it higher or lower. Higher priority jobs will always go in front of lower priority jobs. '''NOTE:''' Jobs and pods that have '''completed''' over 72 hours ago but have not been cleaned up will be automatically removed by the garbage collector. Most jobs will have the "ttlSecondsAfterFinished" configuration item in them, so they will automatically cleaned up after that time expires, but leaving the old pods and jobs around pins the disk space they were using while they remain, so it's good to get rid of them as soon as they are done unless you are debugging a failure or something like that. Jobs that '''run''' over 72 hours will not be deleted, only the ones that have '''exited''' over 72 hours ago. A lot of other good information can be viewed on Rob Currie's github page, which includes examples and some "How To" documentation: https://github.com/rcurrie/kubernetes ==Using Amazon S3== To use S3 from a Kubernetes pod, the pod needs to have the "aws" command installed, and it needs to have the ~/.aws/credentials file, with the credentials granting access, mounted over from a secret. Depending on your namespace, credentials may already be available in a "shared-s3-credentials" secret. If not, you can make a file called "credentials", populate it, and use "kubectl create secret generic secret-name --from-file credentials" to make an appropriate secret. Be sure to use AWS credentials that don't require assuming a role or MFA authentication! Here's a minimal example job YAML that demonstrates using S3. apiVersion: batch/v1 kind: Job metadata: name: username-job spec: ttlSecondsAfterFinished: 1000 template: spec: containers: - name: main imagePullPolicy: Always image: ubuntu:18.04 command: - /bin/bash - -c - | set -e DEBIAN_FRONTEND=noninteractive apt-get update DEBIAN_FRONTEND=noninteractive apt-get install -y awscli aws s3 cp --no-progress s3://bucket/bigfile . volumeMounts: - mountPath: /root/.aws name: s3-credentials resources: limits: cpu: 1 memory: "4Gi" ephemeral-storage: "10Gi" restartPolicy: Never volumes: - name: s3-credentials secret: secretName: shared-s3-credentials backoffLimit: 0 When copying files to and from S3, it is good to use the "--no-progress" option to "aws s3 cp". It's not clever enough to notice that it isn't talking to a real terminal and suppress its progress bar, and the large amount of bytes it sends to draw the progress bar can make it more difficult to inspect logs with k9s or "kubectl get logs". ==Inlining Jobs in Shell and Shell in Jobs== When interactively developing on Kubernetes, it can be useful to be able to have a shell command you can copy and paste to run a Kubernetes job, rather than having to create YAML files on disk. Similarly, it can be useful to have shell scripting inline in your Kubernetes job definitions, rather than having to bake your experimental script into a Docker container. Here's an example that does both, putting the YAML inside a heredoc and putting the script to run in the container inside a multiline YAML string. We precede this with a command to delete the job, so you can modify your script and re-paste it to replace a failed or failing job. We also make sure to mount the AWS credentials in the container, so that the ''aws'' command will be able to access S3 if you install it. kubectl delete job username-job kubectl apply -f - <<'EOF' apiVersion: batch/v1 kind: Job metadata: name: username-job spec: ttlSecondsAfterFinished: 1000 template: spec: containers: - name: main imagePullPolicy: Always image: ubuntu:18.04 command: - /bin/bash - -c - | set -e DEBIAN_FRONTEND=noninteractive apt-get update DEBIAN_FRONTEND=noninteractive apt-get install -y awscli cowsay cowsay "Listing files" aws s3 ls s3://vg-k8s/ volumeMounts: - mountPath: /tmp name: scratch-volume - mountPath: /root/.aws name: s3-credentials resources: limits: cpu: 1 memory: "4Gi" ephemeral-storage: "10Gi" restartPolicy: Never volumes: - name: scratch-volume emptyDir: {} - name: s3-credentials secret: secretName: shared-s3-credentials backoffLimit: 0 EOF Make sure to replace "username-job" with a unique job name that includes ''your'' username. ==View the Cluster's Current Activity== One quick way to check the cluster's utilization is to do: kubectl top nodes NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% k1.kube 1815m 1% 1191Mi 0% k2.kube 51837m 53% 46507Mi 12% k3.kube 1458m 1% 61270Mi 15% master.kube 111m 5% 1024Mi 46% That means the worker nodes, k1, k2 and k3, are using minimal memory, k2 is using 52% CPU but lots of room still open for new jobs. Ignore the master node as that one only handles cluster management and doesn't run jobs or pods for users. Another good way to get a lot of details about the current state of the cluster is through the Kubernetes Dashboard: https://cgl-k8s-dashboard.gi.ucsc.edu/ Select the "token" login method, and paste in this (long) token: eyJhbGciOiJSUzI1NiIsImtpZCI6IiJ9.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJrdWJlcm5ldGVzLWRhc2hib2FyZCIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJrdWJlcm5ldGVzLWRhc2hib2FyZC10b2tlbi0ycDY4cCIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50Lm5hbWUiOiJrdWJlcm5ldGVzLWRhc2hib2FyZCIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50LnVpZCI6ImE5ZGI2Y2I0LWMyYWYtNDM3My04ZmM2LWE4YWYwYTBmNGRkNCIsInN1YiI6InN5c3RlbTpzZXJ2aWNlYWNjb3VudDprdWJlcm5ldGVzLWRhc2hib2FyZDprdWJlcm5ldGVzLWRhc2hib2FyZCJ9.ZVQjryG_ksfvReIfq4Frb6M4sE6OVDOXnFy9Aii-h3mrpdHRE6bgjdAvSGZ0jJSIUEz5GgPBQ0lCwhyZocivHHr4zTrNxMkOFZhPDnpvF6RVIDWTkqmH9Dg6qmro0gTJP75oKBpt7dFN2pW4zvqOAzqPmh7qxfoVusN8X6U13YirMFEf65-aGL-_FFNBsEzvjkC-BgXWbtk3YZc8CJL7xtvlKLyE6u6jC9Qx0SWnwzkALlxmzo_yYTDKpIrWiQGEqzLQOxKml-H0kSYLDX-t4sTivXp4vCw_ruoqwIpLnnQAC7q3ZtSTxHIrxbB7n_M8gfhpXtwprbPav-XmBk1xaQ The dashboard is read-only, so you won't be able to edit anything, it's mostly for seeing what's going on and where. You can also take a look at current resource consumption by taking a look at our Ganglia Cluster monitor tool: https://ganglia.gi.ucsc.edu/ That website requires a username and password: username: genecats password: KiloKluster That's mostly for keeping the scrip kiddies and bots from banging on it. Once you get in, you should see a drop-down menu near the top left of the screen near "Genomics Institute Grid". From the drop-down menu, select "CG Kubernetes Cluster". It will take you to a page detailing the current resource usage and activity on the nodes. This can be useful for see if anyone else is using the whole cluster, or just to get an idea of how many resources are available for your batch of jobs to assign to the cluster. ==Profiling with Perf== You can use Linux's "perf" to profile your code on the Kubernetes cluster. Here is an example of a job that does so. You need to obtain a "perf" binary that matches the version of the kernel that the Kubernetes ''hosts'' are running, which most likely does not correspond to any version of "perf" available in the Ubuntu repositories. Here we download a binary previously uploaded to S3. Also, the Kubernetes hosts have '''Non-Uniform Memory Access (NUMA)''': some physical memory is "closer" to some physical cores than to toher physical cores. The system is divided into '''NUMA nodes''', each containing some cores and some memory. Memory access from a node to its own memory is significantly faster than memory access from a node to other nodes' memory. For consistent profiling, it is important to restrict your application to a single NUMA node if possible, with "numactl", so that all accesses are local to the NUMA node. If you don't do this, your application performance will vary arbitrarily depending on whether and when threads are scheduled on the different NUMA nodes of the system. apiVersion: batch/v1 kind: Job metadata: name: username-profiling spec: ttlSecondsAfterFinished: 1000 template: metadata: # Apply a lable saying that we use NUMA node 0 labels: usesnuma0: "Yes" spec: affinity: # Say that we should not schedule on the same node as any other pod with that label podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: usesnuma0 operator: In values: - "Yes" topologyKey: "kubernetes.io/hostname" containers: - name: main imagePullPolicy: Always image: ubuntu:18.04 command: - /bin/bash - -c - | set -e DEBIAN_FRONTEND=noninteractive apt-get update DEBIAN_FRONTEND=noninteractive apt-get install -y awscli numactl # Use this particular perf binary that matches the hosts' kernels # If it is missing or outdated, get a new one from Erich or cluster-admin aws s3 cp --no-progress s3://vg-k8s/users/adamnovak/projects/test/perf /usr/bin/perf chmod +x /usr/bin/perf # Do your work with perf here. # Use numactl to limit your code to NUMA node 0 for consistent memory access times volumeMounts: - mountPath: /tmp name: scratch-volume - mountPath: /root/.aws name: s3-credentials resources: limits: cpu: 24 # One NUMA node on our machines is 24 cores. memory: "150Gi" ephemeral-storage: "400Gi" restartPolicy: Never volumes: - name: scratch-volume emptyDir: {} - name: s3-credentials secret: secretName: shared-s3-credentials backoffLimit: 0 13f00f3638539758390fb1c28d887974b7982004 217 214 2020-09-08T20:59:10Z Weiler 3 /* View the Cluster's Current Activity */ wikitext text/x-wiki __TOC__ The Computational Genomics Group has a Kubernetes Cluster running on several large instances in AWS. The current cluster makeup includes three worker nodes, each with the following specs: * 96 CPU cores (3.1 GHz) * 384 GB RAM * 3.3 TB Local NVMe Flash Storage * 25 Gb/s Network Interface ==Getting Authorized to Connect== If you require access to this kubernetes cluster, contact Benedict Paten asking for permission to use it, then pass on that permission via email to: cluster-admin@soe.ucsc.edu Let us know which group you are with and we can authorize you to use the cluster in the correct namespace. ==Authenticating to Kubernetes== We will authorize (authz) you to use the cluster on the server side, but you will also need to authenticate (authn) using your '@ucsc.edu' email address and a unique Java Web token. These credentials are installed in ~/.kube/config in whatever machine you are coming from to get to the cluster. To authenticate and get your base kubernetes configuration, go to this URL (below), which will ask you to authenticate to Google. Use your '@ucsc.edu' email address as the login. It will then ask you to authenticate via CruzID Gold if your web browser doesn't already have the authentication token cached: https://cg-kube-auth.gi.ucsc.edu Once you authenticate (via username/password and 2-factor auth for CruzID Gold), it will pass you back to the 'https://cg-kube-auth.gi.ucsc.edu' website and it should confirm authentication on the top with a message saying "Successfully Authenticated". '''If you see any errors in red,''' but are sure you typed in your password and 2-factor auth correctly, click on the above link again (https://cg-kube-auth.gi.ucsc.edu) and authenticate a second time, which should work. There is a quirk where the web token doesn't always pass back to us correctly on the first try. Upon success, you will be able to click the blue "Download Config File" button, which contains your initial kubernetes config file. Copy this file to your home directory as ~/.kube/config. Follow the directions on the web page to insert your '''"namespace:"''' line as directed. We will let you know which namespace to use. ==Testing Connectivity== Once your ~/.kube/config file is set up correctly, you should be able to connect to the cluster. All our shared servers here at the Genomics Institute have the 'kubectl' command installed on them, but if you are coming from somewhere else, just make sure the "kubectl" utility is installed on that machine. A quick test should go as follows: $ kubectl get nodes NAME STATUS ROLES AGE VERSION k1.kube Ready <none> 13h v1.15.3 k2.kube Ready <none> 13h v1.15.3 k3.kube Ready <none> 13h v1.15.3 master.kube Ready master 13h v1.15.3 ==Running Pods and Jobs with Requests and Limits== When running jobs and pods on kubernetes, you will always want to specify "requests" and "limits" on resources, otherwise your pods will get stuck with the default limits which are tiny (to protect against runaway pods). You should always have an idea of how much resources will be consumed by your jobs, and not use much more than that, in order not to hog all the resources. It also prevents your job from "running away" unexpectedly and chewing up more resources than expected. Here is a good example of a job file that specifies limits: job.yml apiVersion: batch/v1 kind: Job metadata: name: $USER-$TS spec: backoffLimit: 0 ttlSecondsAfterFinished: 30 template: spec: containers: - name: magic image: robcurrie/ubuntu imagePullPolicy: Always resources: requests: cpu: "1" memory: "2G" ephemeral-storage: "2G" limits: cpu: "1" memory: "2G" ephemeral-storage: "3G" command: ["/bin/bash", "-c"] args: ['for i in {1..100}; do echo "$i: $(date)"; i=$((i+1)); sleep 1; done'] restartPolicy: Never priorityClassName: medium-priority Please note that the "request" and "limit" item fields should be the same. You would think that you could set the limit higher than the request, but in reality they need to match in order for the pod to stay within the kubernetes resource limit bubble. If you set the limit higher than the request, then you are risking the pod using more memory than the scheduler expects, and the node can start killing off random other colocated pods by way of OOM, which is very, very bad for the cluster. If you omit the "requests" section altogether, the limit values will be used, so if you use only one, use "limits". Also note the "priorityClassName" line. Available values are: high-priority medium-priority low-priority That affects how quickly your jobs move up the queue in the event there are a lot of queued jobs. Always use "medium-priority" as the default unless you specifically know you need it higher or lower. Higher priority jobs will always go in front of lower priority jobs. '''NOTE:''' Jobs and pods that have '''completed''' over 72 hours ago but have not been cleaned up will be automatically removed by the garbage collector. Most jobs will have the "ttlSecondsAfterFinished" configuration item in them, so they will automatically cleaned up after that time expires, but leaving the old pods and jobs around pins the disk space they were using while they remain, so it's good to get rid of them as soon as they are done unless you are debugging a failure or something like that. Jobs that '''run''' over 72 hours will not be deleted, only the ones that have '''exited''' over 72 hours ago. A lot of other good information can be viewed on Rob Currie's github page, which includes examples and some "How To" documentation: https://github.com/rcurrie/kubernetes ==Using Amazon S3== To use S3 from a Kubernetes pod, the pod needs to have the "aws" command installed, and it needs to have the ~/.aws/credentials file, with the credentials granting access, mounted over from a secret. Depending on your namespace, credentials may already be available in a "shared-s3-credentials" secret. If not, you can make a file called "credentials", populate it, and use "kubectl create secret generic secret-name --from-file credentials" to make an appropriate secret. Be sure to use AWS credentials that don't require assuming a role or MFA authentication! Here's a minimal example job YAML that demonstrates using S3. apiVersion: batch/v1 kind: Job metadata: name: username-job spec: ttlSecondsAfterFinished: 1000 template: spec: containers: - name: main imagePullPolicy: Always image: ubuntu:18.04 command: - /bin/bash - -c - | set -e DEBIAN_FRONTEND=noninteractive apt-get update DEBIAN_FRONTEND=noninteractive apt-get install -y awscli aws s3 cp --no-progress s3://bucket/bigfile . volumeMounts: - mountPath: /root/.aws name: s3-credentials resources: limits: cpu: 1 memory: "4Gi" ephemeral-storage: "10Gi" restartPolicy: Never volumes: - name: s3-credentials secret: secretName: shared-s3-credentials backoffLimit: 0 When copying files to and from S3, it is good to use the "--no-progress" option to "aws s3 cp". It's not clever enough to notice that it isn't talking to a real terminal and suppress its progress bar, and the large amount of bytes it sends to draw the progress bar can make it more difficult to inspect logs with k9s or "kubectl get logs". ==Inlining Jobs in Shell and Shell in Jobs== When interactively developing on Kubernetes, it can be useful to be able to have a shell command you can copy and paste to run a Kubernetes job, rather than having to create YAML files on disk. Similarly, it can be useful to have shell scripting inline in your Kubernetes job definitions, rather than having to bake your experimental script into a Docker container. Here's an example that does both, putting the YAML inside a heredoc and putting the script to run in the container inside a multiline YAML string. We precede this with a command to delete the job, so you can modify your script and re-paste it to replace a failed or failing job. We also make sure to mount the AWS credentials in the container, so that the ''aws'' command will be able to access S3 if you install it. kubectl delete job username-job kubectl apply -f - <<'EOF' apiVersion: batch/v1 kind: Job metadata: name: username-job spec: ttlSecondsAfterFinished: 1000 template: spec: containers: - name: main imagePullPolicy: Always image: ubuntu:18.04 command: - /bin/bash - -c - | set -e DEBIAN_FRONTEND=noninteractive apt-get update DEBIAN_FRONTEND=noninteractive apt-get install -y awscli cowsay cowsay "Listing files" aws s3 ls s3://vg-k8s/ volumeMounts: - mountPath: /tmp name: scratch-volume - mountPath: /root/.aws name: s3-credentials resources: limits: cpu: 1 memory: "4Gi" ephemeral-storage: "10Gi" restartPolicy: Never volumes: - name: scratch-volume emptyDir: {} - name: s3-credentials secret: secretName: shared-s3-credentials backoffLimit: 0 EOF Make sure to replace "username-job" with a unique job name that includes ''your'' username. ==View the Cluster's Current Activity== One quick way to check the cluster's utilization is to do: kubectl top nodes NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% k1.kube 1815m 1% 1191Mi 0% k2.kube 51837m 53% 46507Mi 12% k3.kube 1458m 1% 61270Mi 15% master.kube 111m 5% 1024Mi 46% That means the worker nodes, k1, k2 and k3, are using minimal memory, k2 is using 52% CPU but lots of room still open for new jobs. Ignore the master node as that one only handles cluster management and doesn't run jobs or pods for users. Another good way to get a lot of details about the current state of the cluster is through the Kubernetes Dashboard: https://cgl-k8s-dashboard.gi.ucsc.edu/ Select the "token" login method, and paste in this (long) token: eyJhbGciOiJSUzI1NiIsImtpZCI6InhhcTVJLWdkXzMzZzAxUENCdjNBYUJBbkZfZlBwSG9lVmd4S1dZbWZ6TncifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJrdWJlcm5ldGVzLWRhc2hib2FyZCIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJrdWJlcm5ldGVzLWRhc2hib2FyZC10b2tlbi1zNGxkbiIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50Lm5hbWUiOiJrdWJlcm5ldGVzLWRhc2hib2FyZCIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50LnVpZCI6ImYwNWU4NjYyLWUyY2QtNDY3Yy1hYjY3LTNjNDc4ODVjZmM4YSIsInN1YiI6InN5c3RlbTpzZXJ2aWNlYWNjb3VudDprdWJlcm5ldGVzLWRhc2hib2FyZDprdWJlcm5ldGVzLWRhc2hib2FyZCJ9.pQQ3iaWgWGr4CVSxl0wvI9R0AqtxEMm_ElisnfejJtKb9g5ki7goL4VQ9n4lY1b0hO7ojfYcZzWC466FHLULPac6r_zRvme2YMi9EwyHU4iYfUVOktmcLPGl-NS_D3k-USJF8npqbn1OFSHS25pJ5924LFAC0dCkukanNODyNgbetplgkl8geG1pR_1dgqamJCB2xwDn2FjQBC-QjtUJnarGqeo1gqG3eeeWAImK3lGLnkYGPcsvwowmtOdjj2ScqCfjqlfkxWymMGAOB-iB7hEruYZ6dD4hrpIGuVSGQCHojm4FJo_AiFgRjBmfHZiRi0PV1PNoLQLRplpXMf2jOg The dashboard is read-only, so you won't be able to edit anything, it's mostly for seeing what's going on and where. You can also take a look at current resource consumption by taking a look at our Ganglia Cluster monitor tool: https://ganglia.gi.ucsc.edu/ That website requires a username and password: username: genecats password: KiloKluster That's mostly for keeping the scrip kiddies and bots from banging on it. Once you get in, you should see a drop-down menu near the top left of the screen near "Genomics Institute Grid". From the drop-down menu, select "CG Kubernetes Cluster". It will take you to a page detailing the current resource usage and activity on the nodes. This can be useful for see if anyone else is using the whole cluster, or just to get an idea of how many resources are available for your batch of jobs to assign to the cluster. ==Profiling with Perf== You can use Linux's "perf" to profile your code on the Kubernetes cluster. Here is an example of a job that does so. You need to obtain a "perf" binary that matches the version of the kernel that the Kubernetes ''hosts'' are running, which most likely does not correspond to any version of "perf" available in the Ubuntu repositories. Here we download a binary previously uploaded to S3. Also, the Kubernetes hosts have '''Non-Uniform Memory Access (NUMA)''': some physical memory is "closer" to some physical cores than to toher physical cores. The system is divided into '''NUMA nodes''', each containing some cores and some memory. Memory access from a node to its own memory is significantly faster than memory access from a node to other nodes' memory. For consistent profiling, it is important to restrict your application to a single NUMA node if possible, with "numactl", so that all accesses are local to the NUMA node. If you don't do this, your application performance will vary arbitrarily depending on whether and when threads are scheduled on the different NUMA nodes of the system. apiVersion: batch/v1 kind: Job metadata: name: username-profiling spec: ttlSecondsAfterFinished: 1000 template: metadata: # Apply a lable saying that we use NUMA node 0 labels: usesnuma0: "Yes" spec: affinity: # Say that we should not schedule on the same node as any other pod with that label podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: usesnuma0 operator: In values: - "Yes" topologyKey: "kubernetes.io/hostname" containers: - name: main imagePullPolicy: Always image: ubuntu:18.04 command: - /bin/bash - -c - | set -e DEBIAN_FRONTEND=noninteractive apt-get update DEBIAN_FRONTEND=noninteractive apt-get install -y awscli numactl # Use this particular perf binary that matches the hosts' kernels # If it is missing or outdated, get a new one from Erich or cluster-admin aws s3 cp --no-progress s3://vg-k8s/users/adamnovak/projects/test/perf /usr/bin/perf chmod +x /usr/bin/perf # Do your work with perf here. # Use numactl to limit your code to NUMA node 0 for consistent memory access times volumeMounts: - mountPath: /tmp name: scratch-volume - mountPath: /root/.aws name: s3-credentials resources: limits: cpu: 24 # One NUMA node on our machines is 24 cores. memory: "150Gi" ephemeral-storage: "400Gi" restartPolicy: Never volumes: - name: scratch-volume emptyDir: {} - name: s3-credentials secret: secretName: shared-s3-credentials backoffLimit: 0 33dee01456042bcd9561c630cf5f3814fead4353 Genomics Institute Computing Information 0 6 216 181 2020-07-21T16:51:29Z Jgarcia 2 wikitext text/x-wiki Welcome to the Genomic Institute Computing Information Repository! Browse the below topics for help in the area you are curious about. ==GI Public Computing Environment== *[[How to access the public servers]] ==GI Firewalled Computing Environment (PRISM)== *[[Access to the Firewalled Compute Servers]] ==VPN Access== *[[Requirement for users to get GI VPN access]] == NIH dbGaP Access Requirements == *[[Requirements for dbGaP Access]] == giCloud Openstack == *[[Overview of giCloud in the Genomics Institute]] *[[Quick Start Instructions to Get Rolling with OpenStack]] == Amazon Web Services Account Management == *[[Overview of Getting and Using an AWS IAM Account]] *[[AWS Account List and Numbers]] ==Kubernetes Information== *[[Computational Genomics Kubernetes Installation]] *[[Undiagnosed Disease Project Kubernetes Installation]] == Problems or technical support == If you have any problems with the GI computing environment, please send an email to '''cluster-admin@soe.ucsc.edu''' b715fc1bb613fec414edd9e2a569ff4561e54a61 230 216 2021-02-01T19:49:15Z Weiler 3 /* Amazon Web Services Account Management */ wikitext text/x-wiki Welcome to the Genomic Institute Computing Information Repository! Browse the below topics for help in the area you are curious about. ==GI Public Computing Environment== *[[How to access the public servers]] ==GI Firewalled Computing Environment (PRISM)== *[[Access to the Firewalled Compute Servers]] ==VPN Access== *[[Requirement for users to get GI VPN access]] == NIH dbGaP Access Requirements == *[[Requirements for dbGaP Access]] == giCloud Openstack == *[[Overview of giCloud in the Genomics Institute]] *[[Quick Start Instructions to Get Rolling with OpenStack]] == Amazon Web Services Account Management == *[[Overview of Getting and Using an AWS IAM Account]] *[[AWS Account List and Numbers]] *[[AWS Shared Bucket Usage Graphs]] ==Kubernetes Information== *[[Computational Genomics Kubernetes Installation]] *[[Undiagnosed Disease Project Kubernetes Installation]] == Problems or technical support == If you have any problems with the GI computing environment, please send an email to '''cluster-admin@soe.ucsc.edu''' ac263ca76c4916f34a81701aac298968888bc0fc 236 230 2021-08-23T18:23:34Z Weiler 3 /* Amazon Web Services Account Management */ wikitext text/x-wiki Welcome to the Genomic Institute Computing Information Repository! Browse the below topics for help in the area you are curious about. ==GI Public Computing Environment== *[[How to access the public servers]] ==GI Firewalled Computing Environment (PRISM)== *[[Access to the Firewalled Compute Servers]] ==VPN Access== *[[Requirement for users to get GI VPN access]] == NIH dbGaP Access Requirements == *[[Requirements for dbGaP Access]] == giCloud Openstack == *[[Overview of giCloud in the Genomics Institute]] *[[Quick Start Instructions to Get Rolling with OpenStack]] == Amazon Web Services Information == *[[Overview of Getting and Using an AWS IAM Account]] *[[AWS Account List and Numbers]] *[[AWS Shared Bucket Usage Graphs]] *[[AWS Best Practices]] ==Kubernetes Information== *[[Computational Genomics Kubernetes Installation]] *[[Undiagnosed Disease Project Kubernetes Installation]] == Problems or technical support == If you have any problems with the GI computing environment, please send an email to '''cluster-admin@soe.ucsc.edu''' 0cc2ffe880f4aa84ded9477ab945d4b661293284 Requirement for users to get GI VPN access 0 9 218 166 2020-10-05T18:58:54Z Haifang 1 wikitext text/x-wiki If you need VPN access to the Genomics Institute firewalled/secure area (aka the "Prism" or "CIRM" Environment), please complete this form: https://app.smartsheet.com/b/form/a76dbd90ba0240ab9ea9d39b390586ce to request access. There are several requirements to gaining access to the firewalled area - please complete all these requirements '''BEFORE''' coming to have the VPN software set up for your laptop. Please use this checklist to make sure that you have completed all '''six''' requirements. '''1'''. User info, your PI info and your PI's approval '''2'''. NIH Public Security Refresher Course Certificate '''3'''. Signed Genomics Institute VPN User Agreement '''4'''. Signed NIH Genomic Data Sharing Policy Agreement '''5'''. "eduroam" wireless network has setup on your laptop '''6'''. Installed the appropriate OpenVPN software on your laptop '''1''': You are required to ask your PI or sponsor to email cluster-admin@soe.ucsc.edu requesting a VPN account for you - this email should include: Your name Your PI's name Your requested username (If your name is Jane Doe, then your username could be 'jdoe' for example). PI's approval for this access What other access you need such as a UNIX server account or access to OpenStack. '''2''': You must take the NIH Public Security Refresher Course online, then print out the Completion Certificate (which should have your name on it) at the end of the training and deliver it when you come to your appointment to install the VPN software. You must complete the course in a single continuous sitting in order to be able to print the certificate at the end: https://irtsectraining.nih.gov/publicUser.aspx Click on the "2018 Information Security, Counterintelligence, Privacy Awareness, Records Management Refresher" link to begin the course. At the end you will be able to print out the completion certificate that should have your name on it. '''3''': You need to print and sign the Genomics Institute VPN User Agreement and bring it with you to your VPN software installation appointment, located here for download: [[Media:GI_VPN_Policy.pdf]] '''4''': Please print, read and sign the last page of the NIH Genomic Data Sharing Policy agreement, located here for download. Just staple the pages together. Please bring the signed document to your appointment. By signing the document you agree that you have read and understand the policies described therein and that you agree to abide by those policies: [[Media:NIH_GDS_Policy.pdf]] '''5''': You will need access to the "eduroam" wireless network '''prior''' to your appointment. Other UCSC wireless networks such as "cruznet" will not work with our VPN software, so please make sure your laptop works with eduroam before coming to your appointment. Instructions on how to get on eduroam are detailed here: https://its.ucsc.edu/wireless/eduroam.html When using the VPN software off-campus, it will usually work unless the wireless network you are on has restrictions preventing it from functioning. Some other universities have such restrictions (notably UCSF), but most other wireless network and home wireless networks should work fine. '''6''': Before your appointment, please make sure you install the appropriate OpenVPN software on your laptop: A laptop running OS X, Windows or Ubuntu For Macs, please download and install '''Tunnelblick''' from https://tunnelblick.net/downloads.html. Select the Latest Stable version. For Windows, please download and install '''OpenVPN Client''' from https://openvpn.net/index.php/open-source/downloads.html. Select ''openvpn-install-x.x.x-xxxx.exe'' For Ubuntu, please install network-manager-openvpn by typing: sudo apt-get install network-manager-openvpn network-manager-openvpn-gnome Please do NOT worry about how to configure the software at this point. We will help you to set it up at your appointment. We will correspond with you via email on when the appointment will be. The appointment can take up to 30 minutes per person depending on whether or not any issues come up during the software setup. If you show up for your appointment without one (or more) of the above outlined requirements, we will have to reschedule your appointment for a time when you can arrive after completing the above requirements. '''PLEASE NOTE:''' Because of the overhead required in setting up VPN access, please only request access if you have an immediate need to work on data that exists behind the firewall. We have had a decent number of people request access and go through the setup but then never use it. In other words, please do not request access because "one day you might need it", but because you '''do''' actually need it! '''ALSO NOTE:''' VPN accounts expire after one year from the date of first gaining access. To renew for another year you will need your PI/sponsor to send us a note asking for renewal. c001438af3e71d8edf58dbe2b8496249e5529665 224 218 2020-12-09T18:52:43Z Haifang 1 wikitext text/x-wiki If you need VPN access to the Genomics Institute firewalled/secure area (aka the "Prism" or "CIRM" Environment), please complete this form: https://app.smartsheet.com/b/form/a76dbd90ba0240ab9ea9d39b390586ce. There are two parts in this process. 1. For the user, please fill in ALL required fields and submit. 2. For the Sponsor/PI - you will receive an email from Smartsheet. Please fill in all required fields and submit. We will receive your completed request and we will create your account and go over the details via a short zoom meeting with you. to request access. There are several requirements to gaining access to the firewalled area - please complete all these requirements '''BEFORE''' coming to have the VPN software set up for your laptop. Please use this checklist to make sure that you have completed all '''six''' requirements. '''1'''. User info, your PI info and your PI's approval '''2'''. NIH Public Security Refresher Course Certificate '''3'''. Signed Genomics Institute VPN User Agreement '''4'''. Signed NIH Genomic Data Sharing Policy Agreement '''5'''. "eduroam" wireless network has setup on your laptop '''6'''. Installed the appropriate OpenVPN software on your laptop '''1''': You are required to ask your PI or sponsor to email cluster-admin@soe.ucsc.edu requesting a VPN account for you - this email should include: Your name Your PI's name Your requested username (If your name is Jane Doe, then your username could be 'jdoe' for example). PI's approval for this access What other access you need such as a UNIX server account or access to OpenStack. '''2''': You must take the NIH Public Security Refresher Course online, then print out the Completion Certificate (which should have your name on it) at the end of the training and deliver it when you come to your appointment to install the VPN software. You must complete the course in a single continuous sitting in order to be able to print the certificate at the end: https://irtsectraining.nih.gov/publicUser.aspx Click on the "2018 Information Security, Counterintelligence, Privacy Awareness, Records Management Refresher" link to begin the course. At the end you will be able to print out the completion certificate that should have your name on it. '''3''': You need to print and sign the Genomics Institute VPN User Agreement and bring it with you to your VPN software installation appointment, located here for download: [[Media:GI_VPN_Policy.pdf]] '''4''': Please print, read and sign the last page of the NIH Genomic Data Sharing Policy agreement, located here for download. Just staple the pages together. Please bring the signed document to your appointment. By signing the document you agree that you have read and understand the policies described therein and that you agree to abide by those policies: [[Media:NIH_GDS_Policy.pdf]] '''5''': You will need access to the "eduroam" wireless network '''prior''' to your appointment. Other UCSC wireless networks such as "cruznet" will not work with our VPN software, so please make sure your laptop works with eduroam before coming to your appointment. Instructions on how to get on eduroam are detailed here: https://its.ucsc.edu/wireless/eduroam.html When using the VPN software off-campus, it will usually work unless the wireless network you are on has restrictions preventing it from functioning. Some other universities have such restrictions (notably UCSF), but most other wireless network and home wireless networks should work fine. '''6''': Before your appointment, please make sure you install the appropriate OpenVPN software on your laptop: A laptop running OS X, Windows or Ubuntu For Macs, please download and install '''Tunnelblick''' from https://tunnelblick.net/downloads.html. Select the Latest Stable version. For Windows, please download and install '''OpenVPN Client''' from https://openvpn.net/index.php/open-source/downloads.html. Select ''openvpn-install-x.x.x-xxxx.exe'' For Ubuntu, please install network-manager-openvpn by typing: sudo apt-get install network-manager-openvpn network-manager-openvpn-gnome Please do NOT worry about how to configure the software at this point. We will help you to set it up at your appointment. We will correspond with you via email on when the appointment will be. The appointment can take up to 30 minutes per person depending on whether or not any issues come up during the software setup. If you show up for your appointment without one (or more) of the above outlined requirements, we will have to reschedule your appointment for a time when you can arrive after completing the above requirements. '''PLEASE NOTE:''' Because of the overhead required in setting up VPN access, please only request access if you have an immediate need to work on data that exists behind the firewall. We have had a decent number of people request access and go through the setup but then never use it. In other words, please do not request access because "one day you might need it", but because you '''do''' actually need it! '''ALSO NOTE:''' VPN accounts expire after one year from the date of first gaining access. To renew for another year you will need your PI/sponsor to send us a note asking for renewal. 7358d7b74ac2da1bfa1eb02e4ab8c57847fd3700 225 224 2020-12-09T18:59:56Z Haifang 1 wikitext text/x-wiki If you need VPN access to the Genomics Institute firewalled/secure area (aka the "Prism" or "CIRM" Environment), please complete this form: https://app.smartsheet.com/b/form/a76dbd90ba0240ab9ea9d39b390586ce. There are two parts in this process. 1. For the user, please fill in ALL required fields and attach all three required documents. Please see the links and instructions below. 2. For the Sponsor/PI - you will receive an email from Smartsheet. Please fill in all required fields and submit. We will receive your completed request and we will create your account and go over the details via a short zoom meeting with you. to request access. There are several requirements to gaining access to the firewalled area - please complete all these requirements '''BEFORE''' coming to have the VPN software set up for your laptop. Please use this checklist to make sure that you have completed all '''six''' requirements. '''1'''. User info, your PI info and your PI's approval '''2'''. NIH Public Security Refresher Course Certificate '''3'''. Signed Genomics Institute VPN User Agreement '''4'''. Signed NIH Genomic Data Sharing Policy Agreement '''5'''. "eduroam" wireless network has setup on your laptop '''6'''. Installed the appropriate OpenVPN software on your laptop '''1''': You are required to ask your PI or sponsor to email cluster-admin@soe.ucsc.edu requesting a VPN account for you - this email should include: Your name Your PI's name Your requested username (If your name is Jane Doe, then your username could be 'jdoe' for example). PI's approval for this access What other access you need such as a UNIX server account or access to OpenStack. '''2''': You must take the NIH Public Security Refresher Course online, then print out the Completion Certificate (which should have your name on it) at the end of the training and deliver it when you come to your appointment to install the VPN software. You must complete the course in a single continuous sitting in order to be able to print the certificate at the end: https://irtsectraining.nih.gov/publicUser.aspx Click on the "2018 Information Security, Counterintelligence, Privacy Awareness, Records Management Refresher" link to begin the course. At the end you will be able to print out the completion certificate that should have your name on it. '''3''': You need to print and sign the Genomics Institute VPN User Agreement and bring it with you to your VPN software installation appointment, located here for download: [[Media:GI_VPN_Policy.pdf]] '''4''': Please print, read and sign the last page of the NIH Genomic Data Sharing Policy agreement, located here for download. Just staple the pages together. Please bring the signed document to your appointment. By signing the document you agree that you have read and understand the policies described therein and that you agree to abide by those policies: [[Media:NIH_GDS_Policy.pdf]] '''5''': You will need access to the "eduroam" wireless network '''prior''' to your appointment. Other UCSC wireless networks such as "cruznet" will not work with our VPN software, so please make sure your laptop works with eduroam before coming to your appointment. Instructions on how to get on eduroam are detailed here: https://its.ucsc.edu/wireless/eduroam.html When using the VPN software off-campus, it will usually work unless the wireless network you are on has restrictions preventing it from functioning. Some other universities have such restrictions (notably UCSF), but most other wireless network and home wireless networks should work fine. '''6''': Before your appointment, please make sure you install the appropriate OpenVPN software on your laptop: A laptop running OS X, Windows or Ubuntu For Macs, please download and install '''Tunnelblick''' from https://tunnelblick.net/downloads.html. Select the Latest Stable version. For Windows, please download and install '''OpenVPN Client''' from https://openvpn.net/index.php/open-source/downloads.html. Select ''openvpn-install-x.x.x-xxxx.exe'' For Ubuntu, please install network-manager-openvpn by typing: sudo apt-get install network-manager-openvpn network-manager-openvpn-gnome Please do NOT worry about how to configure the software at this point. We will help you to set it up at your appointment. We will correspond with you via email on when the appointment will be. The appointment can take up to 30 minutes per person depending on whether or not any issues come up during the software setup. If you show up for your appointment without one (or more) of the above outlined requirements, we will have to reschedule your appointment for a time when you can arrive after completing the above requirements. '''PLEASE NOTE:''' Because of the overhead required in setting up VPN access, please only request access if you have an immediate need to work on data that exists behind the firewall. We have had a decent number of people request access and go through the setup but then never use it. In other words, please do not request access because "one day you might need it", but because you '''do''' actually need it! '''ALSO NOTE:''' VPN accounts expire after one year from the date of first gaining access. To renew for another year you will need your PI/sponsor to send us a note asking for renewal. 221d18393fe6aad89b9b4b6d566254a3738b40e6 226 225 2020-12-09T19:04:35Z Haifang 1 wikitext text/x-wiki If you need VPN access to the Genomics Institute firewalled/secure area (aka the "Prism" or "CIRM" Environment), please complete this form: https://app.smartsheet.com/b/form/a76dbd90ba0240ab9ea9d39b390586ce. There are two parts in this process. 1. For the user, please fill in ALL required fields and attach all three required documents. Please see the links and instructions below. 2. For the Sponsor/PI - you will receive an email from Smartsheet. Please fill in all required fields and submit. We will receive your completed request and we will create your account and go over the details via a short zoom meeting with you. '''1''': You must take the NIH Public Security Refresher Course online, then print out the Completion Certificate (which should have your name on it) at the end of the training and deliver it when you come to your appointment to install the VPN software. You must complete the course in a single continuous sitting in order to be able to print the certificate at the end: https://irtsectraining.nih.gov/publicUser.aspx Click on the "2018 Information Security, Counterintelligence, Privacy Awareness, Records Management Refresher" link to begin the course. At the end you will be able to print out the completion certificate that should have your name on it. '''2''': You need to print and sign the Genomics Institute VPN User Agreement and bring it with you to your VPN software installation appointment, located here for download: [[Media:GI_VPN_Policy.pdf]] '''3''': Please print, read and sign the last page of the NIH Genomic Data Sharing Policy agreement, located here for download. Just staple the pages together. Please bring the signed document to your appointment. By signing the document you agree that you have read and understand the policies described therein and that you agree to abide by those policies: [[Media:NIH_GDS_Policy.pdf]] You will need access to the "eduroam" wireless network '''prior''' to your zoom appointment if you are on campus. Other UCSC wireless networks such as "cruznet" will not work with our VPN software, so please make sure your laptop works with eduroam before coming to your appointment. Instructions on how to get on eduroam are detailed here: https://its.ucsc.edu/wireless/eduroam.html When using the VPN software off-campus, it will usually work unless the wireless network you are on has restrictions preventing it from functioning. Some other universities have such restrictions (notably UCSF), but most other wireless network and home wireless networks should work fine. Before your appointment, please make sure you install the appropriate OpenVPN software on your laptop: A laptop running OS X, Windows or Ubuntu For Macs, please download and install '''Tunnelblick''' from https://tunnelblick.net/downloads.html. Select the Latest Stable version. For Windows, please download and install '''OpenVPN Client''' from https://openvpn.net/index.php/open-source/downloads.html. Select ''openvpn-install-x.x.x-xxxx.exe'' For Ubuntu, please install network-manager-openvpn by typing: sudo apt-get install network-manager-openvpn network-manager-openvpn-gnome Please do NOT worry about how to configure the software at this point. We will help you to set it up at your appointment. We will correspond with you via email on when the appointment will be. The zoom meeting can take up to 30 minutes per person depending on whether or not any issues come up during the software setup. If you show up for your appointment without one (or more) of the above outlined requirements, we will have to reschedule your appointment for a time when you can arrive after completing the above requirements. '''PLEASE NOTE:''' Because of the overhead required in setting up VPN access, please only request access if you have an immediate need to work on data that exists behind the firewall. We have had a decent number of people request access and go through the setup but then never use it. In other words, please do not request access because "one day you might need it", but because you '''do''' actually need it! '''ALSO NOTE:''' VPN accounts expire after one year from the date of first gaining access. To renew for another year you will need your PI/sponsor to send us a note asking for renewal. 963ebea357ac6540c743adff94464b9afafd8021 227 226 2020-12-09T19:05:43Z Haifang 1 wikitext text/x-wiki If you need VPN access to the Genomics Institute firewalled/secure area (aka the "Prism" or "CIRM" Environment), please complete this form: https://app.smartsheet.com/b/form/a76dbd90ba0240ab9ea9d39b390586ce. There are two parts in this process. 1. For the user, please fill in ALL required fields and attach all three required documents. Please see the links and instructions below. 2. For the Sponsor/PI - you will receive an email from Smartsheet. Please fill in all required fields and submit. We will receive your completed request and we will create your account and go over the details via a short zoom meeting with you. Here are the links to the required documents. '''1''': You must take the NIH Public Security Refresher Course online, then print out the Completion Certificate (which should have your name on it) at the end of the training and deliver it when you come to your appointment to install the VPN software. You must complete the course in a single continuous sitting in order to be able to print the certificate at the end: https://irtsectraining.nih.gov/publicUser.aspx Click on the "2018 Information Security, Counterintelligence, Privacy Awareness, Records Management Refresher" link to begin the course. At the end you will be able to print out the completion certificate that should have your name on it. '''2''': You need to print and sign the Genomics Institute VPN User Agreement and bring it with you to your VPN software installation appointment, located here for download: [[Media:GI_VPN_Policy.pdf]] '''3''': Please print, read and sign the last page of the NIH Genomic Data Sharing Policy agreement, located here for download. Just staple the pages together. Please bring the signed document to your appointment. By signing the document you agree that you have read and understand the policies described therein and that you agree to abide by those policies: [[Media:NIH_GDS_Policy.pdf]] You will need access to the "eduroam" wireless network '''prior''' to your zoom appointment if you are on campus. Other UCSC wireless networks such as "cruznet" will not work with our VPN software, so please make sure your laptop works with eduroam before coming to your appointment. Instructions on how to get on eduroam are detailed here: https://its.ucsc.edu/wireless/eduroam.html When using the VPN software off-campus, it will usually work unless the wireless network you are on has restrictions preventing it from functioning. Some other universities have such restrictions (notably UCSF), but most other wireless network and home wireless networks should work fine. Before your appointment, please make sure you install the appropriate OpenVPN software on your laptop: A laptop running OS X, Windows or Ubuntu For Macs, please download and install '''Tunnelblick''' from https://tunnelblick.net/downloads.html. Select the Latest Stable version. For Windows, please download and install '''OpenVPN Client''' from https://openvpn.net/index.php/open-source/downloads.html. Select ''openvpn-install-x.x.x-xxxx.exe'' For Ubuntu, please install network-manager-openvpn by typing: sudo apt-get install network-manager-openvpn network-manager-openvpn-gnome Please do NOT worry about how to configure the software at this point. We will help you to set it up at your appointment. We will correspond with you via email on when the appointment will be. The zoom meeting can take up to 30 minutes per person depending on whether or not any issues come up during the software setup. If you show up for your appointment without one (or more) of the above outlined requirements, we will have to reschedule your appointment for a time when you can arrive after completing the above requirements. '''PLEASE NOTE:''' Because of the overhead required in setting up VPN access, please only request access if you have an immediate need to work on data that exists behind the firewall. We have had a decent number of people request access and go through the setup but then never use it. In other words, please do not request access because "one day you might need it", but because you '''do''' actually need it! '''ALSO NOTE:''' VPN accounts expire after one year from the date of first gaining access. To renew for another year you will need your PI/sponsor to send us a note asking for renewal. e36d96e020d451b30d2208c7773cb7348009f9ec 244 227 2021-11-15T20:18:44Z Haifang 1 wikitext text/x-wiki If you need VPN access to the Genomics Institute firewalled/secure area (aka the "Prism" or "CIRM" Environment), please complete this form: https://app.smartsheet.com/b/form/a76dbd90ba0240ab9ea9d39b390586ce. There are two parts in this process. 1. For the user, please fill in ALL required fields and attach all three required documents. Please see the links and instructions below. 2. For the Sponsor/PI - you will receive an email from Smartsheet. Please fill in all required fields and submit. We will receive your completed request and we will create your account and go over the details via a short zoom meeting with you. Here are the links to the required documents. '''1''': You must take the NIH Public Security Refresher Course online, then print out the Completion Certificate (which should have your name on it) at the end of the training and deliver it when you come to your appointment to install the VPN software. You must complete the course in a single continuous sitting in order to be able to print the certificate at the end: https://irtsectraining.nih.gov/publicUser.aspx Click on the "2021 Information Security, Counterintelligence, Privacy Awareness, Records Management Refresher" link to begin the course. At the end you will be able to print out the completion certificate that should have your name on it. '''2''': You need to print and sign the Genomics Institute VPN User Agreement and bring it with you to your VPN software installation appointment, located here for download: [[Media:GI_VPN_Policy.pdf]] '''3''': Please print, read and sign the last page of the NIH Genomic Data Sharing Policy agreement, located here for download. Just staple the pages together. Please bring the signed document to your appointment. By signing the document you agree that you have read and understand the policies described therein and that you agree to abide by those policies: [[Media:NIH_GDS_Policy.pdf]] You will need access to the "eduroam" wireless network '''prior''' to your zoom appointment if you are on campus. Other UCSC wireless networks such as "cruznet" will not work with our VPN software, so please make sure your laptop works with eduroam before coming to your appointment. Instructions on how to get on eduroam are detailed here: https://its.ucsc.edu/wireless/eduroam.html When using the VPN software off-campus, it will usually work unless the wireless network you are on has restrictions preventing it from functioning. Some other universities have such restrictions (notably UCSF), but most other wireless network and home wireless networks should work fine. Before your appointment, please make sure you install the appropriate OpenVPN software on your laptop: A laptop running OS X, Windows or Ubuntu For Macs, please download and install '''Tunnelblick''' from https://tunnelblick.net/downloads.html. Select the Latest Stable version. For Windows, please download and install '''OpenVPN Client''' from https://openvpn.net/index.php/open-source/downloads.html. Select ''openvpn-install-x.x.x-xxxx.exe'' For Ubuntu, please install network-manager-openvpn by typing: sudo apt-get install network-manager-openvpn network-manager-openvpn-gnome Please do NOT worry about how to configure the software at this point. We will help you to set it up at your appointment. We will correspond with you via email on when the appointment will be. The zoom meeting can take up to 30 minutes per person depending on whether or not any issues come up during the software setup. If you show up for your appointment without one (or more) of the above outlined requirements, we will have to reschedule your appointment for a time when you can arrive after completing the above requirements. '''PLEASE NOTE:''' Because of the overhead required in setting up VPN access, please only request access if you have an immediate need to work on data that exists behind the firewall. We have had a decent number of people request access and go through the setup but then never use it. In other words, please do not request access because "one day you might need it", but because you '''do''' actually need it! '''ALSO NOTE:''' VPN accounts expire after one year from the date of first gaining access. To renew for another year you will need your PI/sponsor to send us a note asking for renewal. 5dfab1a4ce6abd64efd548f0e60e1dd3a3ccec17 Overview of Getting and Using an AWS IAM Account 0 21 219 176 2020-10-26T16:23:42Z Anovak 4 /* API Access and Secret Keys */ Show how you have to have the config for Toil to work wikitext text/x-wiki __TOC__ == Getting AWS (Amazon Web Services) Access == The Genomics Institute has a series of AWS Accounts that all support different projects. Often if you become associated with one or more of those projects, you will need access to that account or accounts. The way we are managing AWS IAM Account Access is that we have one AWS account that is the 'top level' account that everyone gets access to, and then, once you log in there, you can "Switch Role" into another sub-account that you are running things in. To get access, you will need your PI or Project Manager to email cluster-admin (cluster-admin@soe.ucsc.edu) asking for an AWS account for you, and also in that email to name the projects you will have access to. The cluster-admin group will contact you with your credentials to login. Once you login, you can change your password if you want to and also you will be able to set up MFA (Multi Factor Authentication) for your account. You will be required to use MFA in order to "Switch Role" into any of the sub-accounts for the projects you are working on. The login URL to use when logging in to the top level account is listed below. The top level account is known as "gi-gateway": https://gi-gateway.signin.aws.amazon.com/console When you login, you '''may''' see a couple error messages on the AWS dashboard saying you don't have access to view certain resources - '''this is normal''', so just ignore the error messages. == Configuring Account Credentials == Once you login to the gi-gateway, you will have very few permissions to do anything there - which is normal, since you will not be working in that account anyway. The gi-gateway account is just there to authenticate you to AWS. '''Changing Your Password''' You can change your password by clicking on your username on the top right of the web browser window, just to the right of the little bell. If your username is bill@ucsc.edu, for example: * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window. * Click the '''"My Security Credentials"''' drop-down menu option. * Click the '''"Change Password"''' button to change your password. Note that we have a password strength policy in place, so your password must conform to the following requirements: * Your password must be at least 10 characters long * Your password must contain at least one lowercase letter * Your password must contain at least one non-alphanumeric character * Your password must contain at least one number You will also need to configure MFA on your account before you will be allowed to switch roles into another account. '''Configuring MFA''' To configure MFA (Multi Factor Authentication), the most common way to do it is to use '''Google Authenticator''', which is an app available for Apple and Android based cell phones and mobile devices. The app is free, simply download it from the app store to your cell phone or tablet to get started. Other MFA apps may also work but we have not tested everything out there. Once you have Google Authenticator installed, log into the gi-gateway account using the above URL, then: * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window (again, bill@ucsc.edu is an example). * Click the '''"My Security Credentials"''' drop-down menu option. * Scroll down to the MFA (Multi-Factor Authentication) section of the page, and click '''"Assign MFA Device"'''. * In the following menu select '''"Virtual MFA Device"'''. * In the following window click the '''"Show QR Code"''' link, and the MFA QR barcode will appear on your screen. * Open the Google Authenticator app on your mobile device, and click the little "+" symbol in the top right corner of the app to add an account. * You will then need to select "Scan Barcode" in the Google Authenticaor app to continue, and aim your mobile device camera at the QR barcode. * The new account MFA device should then be set up and you should see a 6 digit number with a small timer to the right of it. You must type in one 6 digit code that it displays into your web browser when asked, then wait for the next code to appear after the timer expires, and type that into the second field. It should then inform you that you have successfully associated an MFA device with your account. Once you have associated an MFA device with the 'gi-gateway' Account, '''log out''', then log back in. It will ask for your username and password, and then ask for your MFA code, which you can view by opening Google Authenticator and seeing what code it is displaying at that time. The code changes every 30 seconds or so. '''You must log out first and log back in using MFA in order to be able to switch roles!!!''' == Switching Roles into Another AWS Account == Now that you have configured a password and enabled MFA, you will be allowed to "Switch Roles" into another account such that you can begin work there. The first time you switch roles into an account it will ask you a few questions, but subsequently it will remember which roles you have access to and they will become a menu item you can click on to quickly switch roles. Let's assume that you want to switch to the 'pangenomics' AWS account, and you have been already granted access to do so by the cluster-admin group. After logging into the 'gi-gateway' account at the URL listed here (same as above): https://gi-gateway.signin.aws.amazon.com/console Do the following to switch roles into the 'pangenomics' account (as an example): * Click '''"bill@ucsc.edu @ gi-gateway"''' on the top right of your browser window (again, bill@ucsc.edu is an example). * Click the '''"Switch Role"''' option in the drop-down menu. * In the following menu it will ask you about the role you will be assuming. In our example we will use the following: -Account* = pangenomics -Role* = developer -Display Name = [leave blank, or use a short phrase] -Color = [choose a color for this role] * Then click the "Switch Role" button. If all went well you should be dumped into the 'pangenomics' account, and you should be identified in the top right hand corner of the page as '''"developer @ pangenomics"''', indicating your role and the account you are active in. You can then work as normal in that account. If you have not yet been given access to that role, you will receive an error message and not be allowed to switch roles. '''NOTE:''' When you switch roles, it may dump you into a region that you don't expect it to. Always verify the region you are in by looking at the top right of the web page - it will display your region there. Most of our stuff exists in "Oregon" (us-west-2), but some items appear in other regions on a per-case basis. If you wish to switch context back to the 'gi-gateway' account in order to manage something, or to switch to another role in another account, simple: * Click '''"developer @ pangenomics"''' in the top right corner of the window. * Select '''"Back to bill@ucsc.edu"''' You will then be sent back to the 'gi-gateway' context, and you can add another role to switch into, manage your credentials and further switch roles. == API Access and Secret Keys == If you require programmatic access to AWS, you will very likely be familiar with the AWS concept of Access Keys and Secret Keys, which can be used by scripts to authenticate yourself to AWS and use the APIs there without using the web console to authenticate. In the past, access keys and secret keys could be used by users with no further authentication. This introduces a security risk, as the management of those keys must be carefully guarded - if anyone gets your keys, they can rack up charges on your AWS account without your knowledge! Using the "Assume Role" mechanism we are now using, Access Keys and Secret Keys can still be created by users '''while logged into the gi-gateway account only'''. Do not try to create keys while you have "Switched Roles" into another account. Keys you create in the top level 'gi-gateway' account will work for you in any sub-account you have access to switch roles to. You will need to do a little more configuration for your keys to work from a UNIX command line however. To set up your access and secret keys for the first time (again, logged into the 'gi-gateway' account only), follow these instructions. Once you log into the gi-gateway web interface, click on your username in the top right corner of the browser window, then click "My Security Credentials". In that screen you will see an "Access Keys" section, and you will have one key listed. Delete that key (using the "Delete" button on the right side of the key), then create a new key using the "Create Access Key" button. It will show you your access and secret key ONCE, so make sure to copy and paste it somewhere. It should be noted that we recommend awscli version 1.16.187 or later, as earlier versions have documented issues with using profiles and MFA related actions. You can determine your version of awscli by doing: aws --version === Entering Base Credentials === Generically, if you plan on using keys for API Access, minimally you will need to configure the "aws" utility and then tweak the config a bit for our setup. To start, run "aws configure". It should look something like this (put in your access and secret keys that you created in the previous step): $ aws configure AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE AWS Secret Access Key [None]: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY Default region name [None]: us-west-2 Default output format [None]: Most folks do that to start. It creates two files: ~/.aws/config ~/.aws/credentials Those two files are important to access AWS via the 'aws' command. '''~/.aws/credentials''' This file contains your access key and secret key, and should not need to be modified after running 'aws configure'. Your same keys can be used to access any roles in any accounts you have access to. '''~/.aws/config''' This file contains some account information you will need to tweak. There are a few ways you could set it up. === Adjusting Configuration for Toil or a Single Role === If you usually use a single role for a single project, or if you need to use Toil with a particular role, you should configure it like this, so that that role is automatically assumed for every operation by default: [default] region = us-west-2 source_profile = default role_arn = arn:aws:iam::422448306679:role/developer mfa_serial = arn:aws:iam::652235167018:mfa/melinda@ucsc.edu duration_seconds = 43200 The "role_arn" line contains the role and account number you are accessing. You can see a list of live account numbers here: [[AWS Account List and Numbers]] Find the account number you need and enter it on the role_arn line, as well as the role name. You will get the role name from the cluster-admin group when you get access. The 'mfa_serial' line contains the identifier for your MFA device. It will always look like '''"arn:aws:iam::652235167018:mfa/[your_iam_username]"'''. The account number listed there will always be "652235167018" because that is the account number of the top level "gi-gateway" account. The "duration_seconds" parameter says that your session token will be 43200 seconds long (12 hours). That means you will only have to authenticate with MFA once every 12 hours. 12 hours is the maximum you can request, although you can specify less than that. This means it won't ask you for MFA every time you run a command for the next 12 hours. Once that is configured, you should be able to use the aws command without any profile specified, and have it automatically assume a role to grant you access: $ aws s3 ls It will ask you for your MFA code and then run the command. Once you enter the MFA code, the token it creates will be valid for 12 hours if you specified "duration_seconds = 43200", or if you omitted that line, the default session duration is one hour, so you can run other 'aws' cli commands without the need to re-authenticate with MFA for the duration of the session. After the session expires, you will need to authenticate via MFA again. === Adjusting Configuration for Multiple Roles === If you have multiple roles that you use equally often, and you don't need to use Toil, you can configure it something like this, with multiple profiles: [default] region = us-west-2 [profile pangenomics-developer] source_profile = default role_arn = arn:aws:iam::422448306679:role/developer mfa_serial = arn:aws:iam::652235167018:mfa/melinda@ucsc.edu duration_seconds = 43200 Once that is configured, you should be able to reference the profile you just created when using the aws command, like so: $ aws s3 ls --profile pangenomics-developer ==Tag Your Resources== When you start using AWS resources (instances, networks, etc), it is very important that you "tag" your resources with the "Owner" tag (note the capital "O"). "Owner" is the key, and the value assigned to it will be your IAM username (i.e. your email address). So, for example, if I spin up an instance, I would tag it during or after creation with something like: Owner = bob@ucsc.edu If you do not tag your instances, '''they will automatically be terminated within 10 minutes.''' Tag your instances especially, but tag every resource you create! This allows us to perform accounting tasks much more easily and allows the Program Managers to know which resources are controlled by who. bb63fef1d577393f44c9a7be8a01f986f0f6587f 220 219 2020-10-26T16:29:52Z Anovak 4 I changed the example email in one section so now I have to change it everywhere wikitext text/x-wiki __TOC__ == Getting AWS (Amazon Web Services) Access == The Genomics Institute has a series of AWS Accounts that all support different projects. Often if you become associated with one or more of those projects, you will need access to that account or accounts. The way we are managing AWS IAM Account Access is that we have one AWS account that is the 'top level' account that everyone gets access to, and then, once you log in there, you can "Switch Role" into another sub-account that you are running things in. To get access, you will need your PI or Project Manager to email cluster-admin (cluster-admin@soe.ucsc.edu) asking for an AWS account for you, and also in that email to name the projects you will have access to. The cluster-admin group will contact you with your credentials to login. Once you login, you can change your password if you want to and also you will be able to set up MFA (Multi Factor Authentication) for your account. You will be required to use MFA in order to "Switch Role" into any of the sub-accounts for the projects you are working on. The login URL to use when logging in to the top level account is listed below. The top level account is known as "gi-gateway": https://gi-gateway.signin.aws.amazon.com/console When you login, you '''may''' see a couple error messages on the AWS dashboard saying you don't have access to view certain resources - '''this is normal''', so just ignore the error messages. == Configuring Account Credentials == Once you login to the gi-gateway, you will have very few permissions to do anything there - which is normal, since you will not be working in that account anyway. The gi-gateway account is just there to authenticate you to AWS. '''Changing Your Password''' You can change your password by clicking on your username on the top right of the web browser window, just to the right of the little bell. If your username is melinda@ucsc.edu, for example: * Click '''"melinda@ucsc.edu @ gi-gateway"''' on the top right of your browser window. * Click the '''"My Security Credentials"''' drop-down menu option. * Click the '''"Change Password"''' button to change your password. Note that we have a password strength policy in place, so your password must conform to the following requirements: * Your password must be at least 10 characters long * Your password must contain at least one lowercase letter * Your password must contain at least one non-alphanumeric character * Your password must contain at least one number You will also need to configure MFA on your account before you will be allowed to switch roles into another account. '''Configuring MFA''' To configure MFA (Multi Factor Authentication), the most common way to do it is to use '''Google Authenticator''', which is an app available for Apple and Android based cell phones and mobile devices. The app is free, simply download it from the app store to your cell phone or tablet to get started. Other MFA apps may also work but we have not tested everything out there. Once you have Google Authenticator installed, log into the gi-gateway account using the above URL, then: * Click '''"melinda@ucsc.edu @ gi-gateway"''' on the top right of your browser window (again, melinda@ucsc.edu is an example). * Click the '''"My Security Credentials"''' drop-down menu option. * Scroll down to the MFA (Multi-Factor Authentication) section of the page, and click '''"Assign MFA Device"'''. * In the following menu select '''"Virtual MFA Device"'''. * In the following window click the '''"Show QR Code"''' link, and the MFA QR barcode will appear on your screen. * Open the Google Authenticator app on your mobile device, and click the little "+" symbol in the top right corner of the app to add an account. * You will then need to select "Scan Barcode" in the Google Authenticaor app to continue, and aim your mobile device camera at the QR barcode. * The new account MFA device should then be set up and you should see a 6 digit number with a small timer to the right of it. You must type in one 6 digit code that it displays into your web browser when asked, then wait for the next code to appear after the timer expires, and type that into the second field. It should then inform you that you have successfully associated an MFA device with your account. Once you have associated an MFA device with the 'gi-gateway' Account, '''log out''', then log back in. It will ask for your username and password, and then ask for your MFA code, which you can view by opening Google Authenticator and seeing what code it is displaying at that time. The code changes every 30 seconds or so. '''You must log out first and log back in using MFA in order to be able to switch roles!!!''' == Switching Roles into Another AWS Account == Now that you have configured a password and enabled MFA, you will be allowed to "Switch Roles" into another account such that you can begin work there. The first time you switch roles into an account it will ask you a few questions, but subsequently it will remember which roles you have access to and they will become a menu item you can click on to quickly switch roles. Let's assume that you want to switch to the 'pangenomics' AWS account, and you have been already granted access to do so by the cluster-admin group. After logging into the 'gi-gateway' account at the URL listed here (same as above): https://gi-gateway.signin.aws.amazon.com/console Do the following to switch roles into the 'pangenomics' account (as an example): * Click '''"melinda@ucsc.edu @ gi-gateway"''' on the top right of your browser window (again, melinda@ucsc.edu is an example). * Click the '''"Switch Role"''' option in the drop-down menu. * In the following menu it will ask you about the role you will be assuming. In our example we will use the following: -Account* = pangenomics -Role* = developer -Display Name = [leave blank, or use a short phrase] -Color = [choose a color for this role] * Then click the "Switch Role" button. If all went well you should be dumped into the 'pangenomics' account, and you should be identified in the top right hand corner of the page as '''"developer @ pangenomics"''', indicating your role and the account you are active in. You can then work as normal in that account. If you have not yet been given access to that role, you will receive an error message and not be allowed to switch roles. '''NOTE:''' When you switch roles, it may dump you into a region that you don't expect it to. Always verify the region you are in by looking at the top right of the web page - it will display your region there. Most of our stuff exists in "Oregon" (us-west-2), but some items appear in other regions on a per-case basis. If you wish to switch context back to the 'gi-gateway' account in order to manage something, or to switch to another role in another account, simple: * Click '''"developer @ pangenomics"''' in the top right corner of the window. * Select '''"Back to melinda@ucsc.edu"''' You will then be sent back to the 'gi-gateway' context, and you can add another role to switch into, manage your credentials and further switch roles. == API Access and Secret Keys == If you require programmatic access to AWS, you will very likely be familiar with the AWS concept of Access Keys and Secret Keys, which can be used by scripts to authenticate yourself to AWS and use the APIs there without using the web console to authenticate. In the past, access keys and secret keys could be used by users with no further authentication. This introduces a security risk, as the management of those keys must be carefully guarded - if anyone gets your keys, they can rack up charges on your AWS account without your knowledge! Using the "Assume Role" mechanism we are now using, Access Keys and Secret Keys can still be created by users '''while logged into the gi-gateway account only'''. Do not try to create keys while you have "Switched Roles" into another account. Keys you create in the top level 'gi-gateway' account will work for you in any sub-account you have access to switch roles to. You will need to do a little more configuration for your keys to work from a UNIX command line however. To set up your access and secret keys for the first time (again, logged into the 'gi-gateway' account only), follow these instructions. Once you log into the gi-gateway web interface, click on your username in the top right corner of the browser window, then click "My Security Credentials". In that screen you will see an "Access Keys" section, and you will have one key listed. Delete that key (using the "Delete" button on the right side of the key), then create a new key using the "Create Access Key" button. It will show you your access and secret key ONCE, so make sure to copy and paste it somewhere. It should be noted that we recommend awscli version 1.16.187 or later, as earlier versions have documented issues with using profiles and MFA related actions. You can determine your version of awscli by doing: aws --version === Entering Base Credentials === Generically, if you plan on using keys for API Access, minimally you will need to configure the "aws" utility and then tweak the config a bit for our setup. To start, run "aws configure". It should look something like this (put in your access and secret keys that you created in the previous step): $ aws configure AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE AWS Secret Access Key [None]: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY Default region name [None]: us-west-2 Default output format [None]: Most folks do that to start. It creates two files: ~/.aws/config ~/.aws/credentials Those two files are important to access AWS via the 'aws' command. '''~/.aws/credentials''' This file contains your access key and secret key, and should not need to be modified after running 'aws configure'. Your same keys can be used to access any roles in any accounts you have access to. '''~/.aws/config''' This file contains some account information you will need to tweak. There are a few ways you could set it up. === Adjusting Configuration for Toil or a Single Role === If you usually use a single role for a single project, or if you need to use Toil with a particular role, you should configure it like this, so that that role is automatically assumed for every operation by default: [default] region = us-west-2 source_profile = default role_arn = arn:aws:iam::422448306679:role/developer mfa_serial = arn:aws:iam::652235167018:mfa/melinda@ucsc.edu duration_seconds = 43200 The "role_arn" line contains the role and account number you are accessing. You can see a list of live account numbers here: [[AWS Account List and Numbers]] Find the account number you need and enter it on the role_arn line, as well as the role name. You will get the role name from the cluster-admin group when you get access. The 'mfa_serial' line contains the identifier for your MFA device. It will always look like '''"arn:aws:iam::652235167018:mfa/[your_iam_username]"'''. The account number listed there will always be "652235167018" because that is the account number of the top level "gi-gateway" account. The "duration_seconds" parameter says that your session token will be 43200 seconds long (12 hours). That means you will only have to authenticate with MFA once every 12 hours. 12 hours is the maximum you can request, although you can specify less than that. This means it won't ask you for MFA every time you run a command for the next 12 hours. Once that is configured, you should be able to use the aws command without any profile specified, and have it automatically assume a role to grant you access: $ aws s3 ls It will ask you for your MFA code and then run the command. Once you enter the MFA code, the token it creates will be valid for 12 hours if you specified "duration_seconds = 43200", or if you omitted that line, the default session duration is one hour, so you can run other 'aws' cli commands without the need to re-authenticate with MFA for the duration of the session. After the session expires, you will need to authenticate via MFA again. === Adjusting Configuration for Multiple Roles === If you have multiple roles that you use equally often, and you don't need to use Toil, you can configure it something like this, with multiple profiles: [default] region = us-west-2 [profile pangenomics-developer] source_profile = default role_arn = arn:aws:iam::422448306679:role/developer mfa_serial = arn:aws:iam::652235167018:mfa/melinda@ucsc.edu duration_seconds = 43200 Once that is configured, you should be able to reference the profile you just created when using the aws command, like so: $ aws s3 ls --profile pangenomics-developer ==Tag Your Resources== When you start using AWS resources (instances, networks, etc), it is very important that you "tag" your resources with the "Owner" tag (note the capital "O"). "Owner" is the key, and the value assigned to it will be your IAM username (i.e. your email address). So, for example, if I spin up an instance, I would tag it during or after creation with something like: Owner = bob@ucsc.edu If you do not tag your instances, '''they will automatically be terminated within 10 minutes.''' Tag your instances especially, but tag every resource you create! This allows us to perform accounting tasks much more easily and allows the Program Managers to know which resources are controlled by who. bef45748d9ee7bbb38385e857e1763bb7885d11c How to access the public servers 0 11 221 138 2020-12-01T19:15:54Z Haifang 1 /* How to Gain Access to the Public Genomics Institute Compute Servers */ wikitext text/x-wiki == How to Gain Access to the Public Genomics Institute Compute Servers == If you need access to the Genomics Institute compute servers, please complete this form: https://app.smartsheet.com/b/form/a76dbd90ba0240ab9ea9d39b390586ce requesting that you be granted access. Then we can create your account and go over the details via a short zoom meeting. == Account Expiration == Your UNIX account will have an expiration date associated with it after creation. If your account was created in January, February or March, then your account expiration date will be July 1st of the '''current year'''. If the account was created after March, then your expiration date will be July 1st of the '''following year'''. You will receive notice by email when your account is about to expire. To renew, simply ask the PI that sponsored you (which will be included in the notice) to email '''cluster-admin@soe.ucsc.edu''' requesting that your account to be renewed for another year. If your account expires, the account will be suspended and you will no longer be able to login or view any data you may have in our systems. Any automated scripts (owned by you) that run via cron or other mechanisms will cease to function. == Server Types and Management== You can ssh into our public compute servers via SSH: '''courtyard.gi.ucsc.edu''': 1TB RAM, 64 cores, 672GB local scratch space '''plaza.gi.ucsc.edu''': 1TB RAM, 64 cores, 672GB local scratch space These servers are running CentOS 7.5 Linux. They are managed by the Genomics Institute Cluster Admin group. If you need software installed on them, please make your request by emailing cluster-admin@soe.ucsc.edu. == Storage == These servers mount two types of storage; home directories and group storage directories. Your home directory will be located as "/public/home/username" and has a 30GB quota. The group storage directories are created per PI, and each group directory has a 15TB quota. For example, if David Haussler is the PI that you report to directly, then the directory would exist as /public/groups/hausslerlab. Request access to that group directory and you will then be able to write to it. Each of those group directories are shared by the lab it belongs to, so you must be wary of everyone's data usage and share the 15TB available per group accordingly. On the compute servers you can check your group's current quota usage by using the '/usr/bin/viewquota' command. You can only check the quota of a group you are part of (you would be a member of the UNIX group of the same name). If you wanted to check the quota usage of /public/groups/hausslerlab for example, you would do: $ viewquota hausslerlab Project quota on /export (/dev/mapper/export) Project ID Used Soft Hard Warn/Grace ---------- --------------------------------- hausslerlab 1.8T 15T 16T 00 [------] == Actually Doing Work and Computing == When doing research, running jobs and the like, please be careful of your resource consumption on the server you are on. Don't run too many threads or cores at once if such a thing overruns the RAM available or the disk IO available. If you are not sure of your potential RAM, CPU or disk impact, start small with one or two processes and work your way up from there. Also, before running your stuff, check what else is already happening on the server by using the 'top' command to see who else and what else is running and what kind of resources are already being consumed. If, after starting a process, you realize that the server slows down considerably or becomes unusable, kill your processes and re-evaluate what you need to make things work. These servers are shared resources - be a good neighbor! == Serving Files to the Public via the Web == If you want to setup a web page on courtyard, or serve files over HTTP from there, do this: mkdir /public/home/''your_username''/public_html chmod 755 /public/home/''your_username''/public_html Put data in the public_html directory. The URL will be: http://public.gi.ucsc.edu/''~username''/ == /scratch Space on the Servers == Each server will generally have a local /scratch filesystem that you can use to store temporary files. '''BE ADVISED''' that /scratch is not backed up, and the data there could disappear in the event of a disk failure or anything else. Do not store important data there. If it is important, it should be moved somewhere else very soon after creation. be0444fc1df697f4d35c65f94cb63aa8f38d1cdd 222 221 2020-12-09T18:35:32Z Haifang 1 /* How to Gain Access to the Public Genomics Institute Compute Servers */ wikitext text/x-wiki == How to Gain Access to the Public Genomics Institute Compute Servers == If you need access to the Genomics Institute compute servers please complete this form: https://app.smartsheet.com/b/form/a76dbd90ba0240ab9ea9d39b390586ce. There are two parts in this process. 1. For the user, please fill in ALL required fields and submit. 2. For the Sponsor/PI - you will receive an email from Smartsheet. Please fill in all required fields and submit. We will receive your completed request and we will create your account and go over the details via a short zoom meeting with you. == Account Expiration == Your UNIX account will have an expiration date associated with it after creation. If your account was created in January, February or March, then your account expiration date will be July 1st of the '''current year'''. If the account was created after March, then your expiration date will be July 1st of the '''following year'''. You will receive notice by email when your account is about to expire. To renew, simply ask the PI that sponsored you (which will be included in the notice) to email '''cluster-admin@soe.ucsc.edu''' requesting that your account to be renewed for another year. If your account expires, the account will be suspended and you will no longer be able to login or view any data you may have in our systems. Any automated scripts (owned by you) that run via cron or other mechanisms will cease to function. == Server Types and Management== You can ssh into our public compute servers via SSH: '''courtyard.gi.ucsc.edu''': 1TB RAM, 64 cores, 672GB local scratch space '''plaza.gi.ucsc.edu''': 1TB RAM, 64 cores, 672GB local scratch space These servers are running CentOS 7.5 Linux. They are managed by the Genomics Institute Cluster Admin group. If you need software installed on them, please make your request by emailing cluster-admin@soe.ucsc.edu. == Storage == These servers mount two types of storage; home directories and group storage directories. Your home directory will be located as "/public/home/username" and has a 30GB quota. The group storage directories are created per PI, and each group directory has a 15TB quota. For example, if David Haussler is the PI that you report to directly, then the directory would exist as /public/groups/hausslerlab. Request access to that group directory and you will then be able to write to it. Each of those group directories are shared by the lab it belongs to, so you must be wary of everyone's data usage and share the 15TB available per group accordingly. On the compute servers you can check your group's current quota usage by using the '/usr/bin/viewquota' command. You can only check the quota of a group you are part of (you would be a member of the UNIX group of the same name). If you wanted to check the quota usage of /public/groups/hausslerlab for example, you would do: $ viewquota hausslerlab Project quota on /export (/dev/mapper/export) Project ID Used Soft Hard Warn/Grace ---------- --------------------------------- hausslerlab 1.8T 15T 16T 00 [------] == Actually Doing Work and Computing == When doing research, running jobs and the like, please be careful of your resource consumption on the server you are on. Don't run too many threads or cores at once if such a thing overruns the RAM available or the disk IO available. If you are not sure of your potential RAM, CPU or disk impact, start small with one or two processes and work your way up from there. Also, before running your stuff, check what else is already happening on the server by using the 'top' command to see who else and what else is running and what kind of resources are already being consumed. If, after starting a process, you realize that the server slows down considerably or becomes unusable, kill your processes and re-evaluate what you need to make things work. These servers are shared resources - be a good neighbor! == Serving Files to the Public via the Web == If you want to setup a web page on courtyard, or serve files over HTTP from there, do this: mkdir /public/home/''your_username''/public_html chmod 755 /public/home/''your_username''/public_html Put data in the public_html directory. The URL will be: http://public.gi.ucsc.edu/''~username''/ == /scratch Space on the Servers == Each server will generally have a local /scratch filesystem that you can use to store temporary files. '''BE ADVISED''' that /scratch is not backed up, and the data there could disappear in the event of a disk failure or anything else. Do not store important data there. If it is important, it should be moved somewhere else very soon after creation. a201cd4b6de029f4a29c356a580a1f193aab8385 223 222 2020-12-09T18:36:28Z Haifang 1 /* How to Gain Access to the Public Genomics Institute Compute Servers */ wikitext text/x-wiki == How to Gain Access to the Public Genomics Institute Compute Servers == If you need access to the Genomics Institute compute servers please complete this form: https://app.smartsheet.com/b/form/a76dbd90ba0240ab9ea9d39b390586ce. There are two parts in this process. 1. For the user, please fill in ALL required fields and submit. 2. For the Sponsor/PI - you will receive an email from Smartsheet. Please fill in all required fields and submit. We will receive your completed request and we will create your account and go over the details via a short zoom meeting with you. == Account Expiration == Your UNIX account will have an expiration date associated with it after creation. If your account was created in January, February or March, then your account expiration date will be July 1st of the '''current year'''. If the account was created after March, then your expiration date will be July 1st of the '''following year'''. You will receive notice by email when your account is about to expire. To renew, simply ask the PI that sponsored you (which will be included in the notice) to email '''cluster-admin@soe.ucsc.edu''' requesting that your account to be renewed for another year. If your account expires, the account will be suspended and you will no longer be able to login or view any data you may have in our systems. Any automated scripts (owned by you) that run via cron or other mechanisms will cease to function. == Server Types and Management== You can ssh into our public compute servers via SSH: '''courtyard.gi.ucsc.edu''': 1TB RAM, 64 cores, 672GB local scratch space '''plaza.gi.ucsc.edu''': 1TB RAM, 64 cores, 672GB local scratch space These servers are running CentOS 7.5 Linux. They are managed by the Genomics Institute Cluster Admin group. If you need software installed on them, please make your request by emailing cluster-admin@soe.ucsc.edu. == Storage == These servers mount two types of storage; home directories and group storage directories. Your home directory will be located as "/public/home/username" and has a 30GB quota. The group storage directories are created per PI, and each group directory has a 15TB quota. For example, if David Haussler is the PI that you report to directly, then the directory would exist as /public/groups/hausslerlab. Request access to that group directory and you will then be able to write to it. Each of those group directories are shared by the lab it belongs to, so you must be wary of everyone's data usage and share the 15TB available per group accordingly. On the compute servers you can check your group's current quota usage by using the '/usr/bin/viewquota' command. You can only check the quota of a group you are part of (you would be a member of the UNIX group of the same name). If you wanted to check the quota usage of /public/groups/hausslerlab for example, you would do: $ viewquota hausslerlab Project quota on /export (/dev/mapper/export) Project ID Used Soft Hard Warn/Grace ---------- --------------------------------- hausslerlab 1.8T 15T 16T 00 [------] == Actually Doing Work and Computing == When doing research, running jobs and the like, please be careful of your resource consumption on the server you are on. Don't run too many threads or cores at once if such a thing overruns the RAM available or the disk IO available. If you are not sure of your potential RAM, CPU or disk impact, start small with one or two processes and work your way up from there. Also, before running your stuff, check what else is already happening on the server by using the 'top' command to see who else and what else is running and what kind of resources are already being consumed. If, after starting a process, you realize that the server slows down considerably or becomes unusable, kill your processes and re-evaluate what you need to make things work. These servers are shared resources - be a good neighbor! == Serving Files to the Public via the Web == If you want to setup a web page on courtyard, or serve files over HTTP from there, do this: mkdir /public/home/''your_username''/public_html chmod 755 /public/home/''your_username''/public_html Put data in the public_html directory. The URL will be: http://public.gi.ucsc.edu/''~username''/ == /scratch Space on the Servers == Each server will generally have a local /scratch filesystem that you can use to store temporary files. '''BE ADVISED''' that /scratch is not backed up, and the data there could disappear in the event of a disk failure or anything else. Do not store important data there. If it is important, it should be moved somewhere else very soon after creation. 459cbc4c92b7f3e0b9ff4cee8173bb587bac372c 241 223 2021-10-28T17:39:13Z Weiler 3 wikitext text/x-wiki == How to Gain Access to the Public Genomics Institute Compute Servers == If you need access to the Genomics Institute compute servers please complete this form: https://app.smartsheet.com/b/form/a76dbd90ba0240ab9ea9d39b390586ce. There are two parts in this process. 1. For the user, please fill in ALL required fields and submit. 2. For the Sponsor/PI - you will receive an email from Smartsheet. Please fill in all required fields and submit. We will receive your completed request and we will create your account and go over the details via a short zoom meeting with you. == Account and Storage Cost == Costs for having an active UNIX account and for storage (per TB) are listed in this document, under "Genomics Project Support", specifically "Genomics IT Systems User Support per user" and "Genomics Data Storage per TB": https://planning.ucsc.edu/budget/rates-and-assessments/recharge-rates/docs/2021-22-approved-recharge-rates.pdf == Account Expiration == Your UNIX account will have an expiration date associated with it after creation. If your account was created in January, February or March, then your account expiration date will be July 1st of the '''current year'''. If the account was created after March, then your expiration date will be July 1st of the '''following year'''. You will receive notice by email when your account is about to expire. To renew, simply ask the PI that sponsored you (which will be included in the notice) to email '''cluster-admin@soe.ucsc.edu''' requesting that your account to be renewed for another year. If your account expires, the account will be suspended and you will no longer be able to login or view any data you may have in our systems. Any automated scripts (owned by you) that run via cron or other mechanisms will cease to function. == Server Types and Management== You can ssh into our public compute servers via SSH: '''courtyard.gi.ucsc.edu''': 1TB RAM, 64 cores, 672GB local scratch space '''plaza.gi.ucsc.edu''': 1TB RAM, 64 cores, 672GB local scratch space These servers are running CentOS 7.5 Linux. They are managed by the Genomics Institute Cluster Admin group. If you need software installed on them, please make your request by emailing cluster-admin@soe.ucsc.edu. == Storage == These servers mount two types of storage; home directories and group storage directories. Your home directory will be located as "/public/home/username" and has a 30GB quota. The group storage directories are created per PI, and each group directory has a 15TB quota. For example, if David Haussler is the PI that you report to directly, then the directory would exist as /public/groups/hausslerlab. Request access to that group directory and you will then be able to write to it. Each of those group directories are shared by the lab it belongs to, so you must be wary of everyone's data usage and share the 15TB available per group accordingly. On the compute servers you can check your group's current quota usage by using the '/usr/bin/viewquota' command. You can only check the quota of a group you are part of (you would be a member of the UNIX group of the same name). If you wanted to check the quota usage of /public/groups/hausslerlab for example, you would do: $ viewquota hausslerlab Project quota on /export (/dev/mapper/export) Project ID Used Soft Hard Warn/Grace ---------- --------------------------------- hausslerlab 1.8T 15T 16T 00 [------] == Actually Doing Work and Computing == When doing research, running jobs and the like, please be careful of your resource consumption on the server you are on. Don't run too many threads or cores at once if such a thing overruns the RAM available or the disk IO available. If you are not sure of your potential RAM, CPU or disk impact, start small with one or two processes and work your way up from there. Also, before running your stuff, check what else is already happening on the server by using the 'top' command to see who else and what else is running and what kind of resources are already being consumed. If, after starting a process, you realize that the server slows down considerably or becomes unusable, kill your processes and re-evaluate what you need to make things work. These servers are shared resources - be a good neighbor! == Serving Files to the Public via the Web == If you want to setup a web page on courtyard, or serve files over HTTP from there, do this: mkdir /public/home/''your_username''/public_html chmod 755 /public/home/''your_username''/public_html Put data in the public_html directory. The URL will be: http://public.gi.ucsc.edu/''~username''/ == /scratch Space on the Servers == Each server will generally have a local /scratch filesystem that you can use to store temporary files. '''BE ADVISED''' that /scratch is not backed up, and the data there could disappear in the event of a disk failure or anything else. Do not store important data there. If it is important, it should be moved somewhere else very soon after creation. 5c7c480084007d3be82a3a151e908a90a66bc580 243 241 2021-11-10T18:34:09Z Haifang 1 wikitext text/x-wiki == How to Gain Access to the Public Genomics Institute Compute Servers == If you need access to the Genomics Institute compute servers please complete this request form: https://app.smartsheet.com/b/form/a76dbd90ba0240ab9ea9d39b390586ce. There are two parts in this process. 1. For the user, please fill in ALL required fields and submit. 2. For the Sponsor/PI - you will receive an email from Smartsheet. Please fill in all required fields and submit. We will receive your completed request and we will create your account and go over the details via a short zoom meeting with you. == Account and Storage Cost == Costs for having an active UNIX account and for storage (per TB) are listed in this document, under "Genomics Project Support", specifically "Genomics IT Systems User Support per user" and "Genomics Data Storage per TB": https://planning.ucsc.edu/budget/rates-and-assessments/recharge-rates/docs/2021-22-approved-recharge-rates.pdf == Account Expiration == Your UNIX account will have an expiration date associated with it after creation. If your account was created in January, February or March, then your account expiration date will be July 1st of the '''current year'''. If the account was created after March, then your expiration date will be July 1st of the '''following year'''. You will receive notice by email when your account is about to expire. To renew, simply ask the PI that sponsored you (which will be included in the notice) to email '''cluster-admin@soe.ucsc.edu''' requesting that your account to be renewed for another year. If your account expires, the account will be suspended and you will no longer be able to login or view any data you may have in our systems. Any automated scripts (owned by you) that run via cron or other mechanisms will cease to function. == Server Types and Management== You can ssh into our public compute servers via SSH: '''courtyard.gi.ucsc.edu''': 1TB RAM, 64 cores, 672GB local scratch space '''plaza.gi.ucsc.edu''': 1TB RAM, 64 cores, 672GB local scratch space These servers are running CentOS 7.5 Linux. They are managed by the Genomics Institute Cluster Admin group. If you need software installed on them, please make your request by emailing cluster-admin@soe.ucsc.edu. == Storage == These servers mount two types of storage; home directories and group storage directories. Your home directory will be located as "/public/home/username" and has a 30GB quota. The group storage directories are created per PI, and each group directory has a 15TB quota. For example, if David Haussler is the PI that you report to directly, then the directory would exist as /public/groups/hausslerlab. Request access to that group directory and you will then be able to write to it. Each of those group directories are shared by the lab it belongs to, so you must be wary of everyone's data usage and share the 15TB available per group accordingly. On the compute servers you can check your group's current quota usage by using the '/usr/bin/viewquota' command. You can only check the quota of a group you are part of (you would be a member of the UNIX group of the same name). If you wanted to check the quota usage of /public/groups/hausslerlab for example, you would do: $ viewquota hausslerlab Project quota on /export (/dev/mapper/export) Project ID Used Soft Hard Warn/Grace ---------- --------------------------------- hausslerlab 1.8T 15T 16T 00 [------] == Actually Doing Work and Computing == When doing research, running jobs and the like, please be careful of your resource consumption on the server you are on. Don't run too many threads or cores at once if such a thing overruns the RAM available or the disk IO available. If you are not sure of your potential RAM, CPU or disk impact, start small with one or two processes and work your way up from there. Also, before running your stuff, check what else is already happening on the server by using the 'top' command to see who else and what else is running and what kind of resources are already being consumed. If, after starting a process, you realize that the server slows down considerably or becomes unusable, kill your processes and re-evaluate what you need to make things work. These servers are shared resources - be a good neighbor! == Serving Files to the Public via the Web == If you want to setup a web page on courtyard, or serve files over HTTP from there, do this: mkdir /public/home/''your_username''/public_html chmod 755 /public/home/''your_username''/public_html Put data in the public_html directory. The URL will be: http://public.gi.ucsc.edu/''~username''/ == /scratch Space on the Servers == Each server will generally have a local /scratch filesystem that you can use to store temporary files. '''BE ADVISED''' that /scratch is not backed up, and the data there could disappear in the event of a disk failure or anything else. Do not store important data there. If it is important, it should be moved somewhere else very soon after creation. 804d28ed49cfe254fd8a8c3545013b70105dac1c Requirements for dbGaP Access 0 19 228 104 2021-01-25T19:06:23Z Weiler 3 wikitext text/x-wiki If you need NIH dbGaP access, there are several requirements to gaining access - please complete all these requirements '''BEFORE''' requesting dbGaP credentials. NOTE: If you already have GI VPN access to the GI "Prism" Environment, then you have already completed the requirements detailed below - let the GI Grants Team know and we can quickly move to getting you set up. Please use this checklist to make sure that you have completed all '''three''' requirements. '''1'''. Your PI's info and your PI's approval '''2'''. NIH Public Security Refresher Course Certificate '''3'''. Signed NIH Genomic Data Sharing Policy Agreement '''1''': You are required to ask your PI or sponsor to email '''GI Grants Team (gi-grant.team@ucsc.edu)''' requesting dbGaP access for you - this email should include: Your name Your PI's name PI's approval for this access '''2''': You must take the NIH Public Security Refresher Course online, then print out the Completion Certificate (which should have your name on it) at the end of the training and deliver it to the GI Grants Team. You must complete the course in a single continuous sitting in order to be able to print the certificate at the end: https://irtsectraining.nih.gov/publicUser.aspx Click on the "2020 Information Security, Counterintelligence, Privacy Awareness, Records Management Refresher" link to begin the course. At the end you will be able to print out the completion certificate that should have your name on it. '''3''': Please print, read entire NIH Genomic Data Sharing Policy agreement (link located below for download), sign the last page of the document, scan and email executed document to gi-grant.team@ucsc.edu with Subject Line to include: NIH GDS document. By signing the document you agree that you have read and understand the policies described therein and that you agree to abide by those policies: [[Media:NIH_GDS_Policy.pdf]] We will correspond with you via email on when the appointment will be - please email the GI Grants Team about getting everything set up! ('''gi-grant.team@ucsc.edu''') 9cfe29ace321de95df784138ada5b8e182a7110b 229 228 2021-01-25T19:07:02Z Weiler 3 wikitext text/x-wiki If you need NIH dbGaP access, there are several requirements to gaining access - please complete all these requirements '''BEFORE''' requesting dbGaP credentials. NOTE: If you already have GI VPN access to the GI "Prism" Environment, then you have already completed the requirements detailed below - let the GI Grants Team know and we can quickly move to getting you set up. Please use this checklist to make sure that you have completed all '''three''' requirements. '''1'''. Your PI's info and your PI's approval '''2'''. NIH Public Security Refresher Course Certificate '''3'''. Signed NIH Genomic Data Sharing Policy Agreement '''1''': You are required to ask your PI or sponsor to email '''GI Grants Team (gi-grant.team@ucsc.edu)''' requesting dbGaP access for you - this email should include: Your name Your PI's name PI's approval for this access '''2''': You must take the NIH Public Security Refresher Course online, then print out the Completion Certificate (which should have your name on it) at the end of the training and deliver it to the GI Grants Team. You must complete the course in a single continuous sitting in order to be able to print the certificate at the end: https://irtsectraining.nih.gov/publicUser.aspx Click on the "2020 Information Security, Counterintelligence, Privacy Awareness, Records Management Refresher" link to begin the course. At the end you will be able to print out the completion certificate that should have your name on it. '''3''': Please print, read entire NIH Genomic Data Sharing Policy agreement (link located below for download), sign the last page of the document, scan and email executed document to gi-grant.team@ucsc.edu with Subject Line to include: NIH GDS document. By signing the document you agree that you have read and understand the policies described therein and that you agree to abide by those policies: [[Media:NIH_GDS_Policy.pdf]] 51cb896bb7a8b94c1937faf3ecdacef7ea201d64 235 229 2021-03-24T17:08:14Z Weiler 3 wikitext text/x-wiki If you need NIH dbGaP access, there are several requirements to gaining access - please complete all these requirements '''BEFORE''' requesting dbGaP credentials. NOTE: If you already have GI VPN access to the GI "Prism" Environment, then you have already completed the requirements detailed below - let Haifang Telc (haifang@ucsc.edu) know and we can quickly move to getting you set up. Please use this checklist to make sure that you have completed all '''three''' requirements. '''1'''. Your PI's info and your PI's approval '''2'''. NIH Public Security Refresher Course Certificate '''3'''. Signed NIH Genomic Data Sharing Policy Agreement '''1''': You are required to ask your PI or sponsor to email '''Haifang Telc (haifang@ucsc.edu)''' requesting dbGaP access for you - this email should include: Your name Your PI's name PI's approval for this access '''2''': You must take the NIH Public Security Refresher Course online, then print out the Completion Certificate (which should have your name on it) at the end of the training and deliver it to the GI Grants Team. You must complete the course in a single continuous sitting in order to be able to print the certificate at the end: https://irtsectraining.nih.gov/publicUser.aspx Click on the "2020 Information Security, Counterintelligence, Privacy Awareness, Records Management Refresher" link to begin the course. At the end you will be able to print out the completion certificate that should have your name on it. '''3''': Please print, read entire NIH Genomic Data Sharing Policy agreement (link located below for download), sign the last page of the document, scan and email executed document to haifang@ucsc.edu with Subject Line to include: NIH GDS document. By signing the document you agree that you have read and understand the policies described therein and that you agree to abide by those policies: [[Media:NIH_GDS_Policy.pdf]] 73402a813a8c220576091438bcfbf01e0d9f142d AWS Shared Bucket Usage Graphs 0 29 231 2021-02-01T19:52:02Z Weiler 3 Created page with "On this page are listed some pie charts indicating the usage breakdown of "shared" buckets on AWS, so as to get an idea of where the data is being used. The buckets are divid..." wikitext text/x-wiki On this page are listed some pie charts indicating the usage breakdown of "shared" buckets on AWS, so as to get an idea of where the data is being used. The buckets are divided up by account. [[vg-dev]] [vg-data]http://logserv.gi.ucsc.edu/cgi-bin/vg-data.cgi?cmd=index&path=/s3fs/vg-data bd311707388d1215b5fea6e67a3d83aa8f60f9da 232 231 2021-02-01T19:54:29Z Weiler 3 wikitext text/x-wiki On this page are listed some pie charts indicating the usage breakdown of "shared" buckets on AWS, so as to get an idea of where the data is being used. The buckets are divided up by account. <u>vg-dev</u> [http://logserv.gi.ucsc.edu/cgi-bin/vg-data.cgi?cmd=index&path=/s3fs/vg-data vg-data] 4ed79699e53357a3214ee936800f902455d3dffc 233 232 2021-02-01T19:55:46Z Weiler 3 wikitext text/x-wiki On this page are listed some pie charts indicating the usage breakdown of "shared" buckets on AWS, so as to get an idea of where the data is being used. The buckets are divided up by account. <u> == vg-dev == </u> [http://logserv.gi.ucsc.edu/cgi-bin/vg-data.cgi?cmd=index&path=/s3fs/vg-data vg-data] 02c9cb5e3827840657ae03c136f8a02c34c4cdaa 234 233 2021-02-01T19:56:16Z Weiler 3 wikitext text/x-wiki On this page are listed some pie charts indicating the usage breakdown of "shared" buckets on AWS, so as to get an idea of where the data is being used. The buckets are divided up by account. <u> == vg-dev Buckets == </u> [http://logserv.gi.ucsc.edu/cgi-bin/vg-data.cgi?cmd=index&path=/s3fs/vg-data vg-data] 5203f62ebb8c2babcde5dc6381b7b28bcf5ddcef AWS Best Practices 0 30 237 2021-08-23T19:13:15Z Weiler 3 Created page with "When using AWS, there are a few things to keep in mind in order to keep costs down: '''EC2''' [[Instances:]] When using instances, always pick and instance type that just q..." wikitext text/x-wiki When using AWS, there are a few things to keep in mind in order to keep costs down: '''EC2''' [[Instances:]] When using instances, always pick and instance type that just qualifies for what you need, nothing much larger, otherwise it's wasted CPU time which costs more. Also, shut down your instance as soon as you no longer actively need it, as instances that are shut down do not accrue cost. [[EBS Volumes:]] hen you create an EBS volume, the volume accrues cost whether or not you have it attached to an instance, and whether or not it actually has data in it. Do not use EBS volumes for long term storage when possible, as EBS volumes are much more expensive per GB than S3 for storage. Also, if you need to spawn many instances for a short period of time, always try to use AWS "spot" instances to do the work when possible. Usually it takes a little waiting in order to successfully take advantage of spot instances, but the cost is 1/4 to 1/20 the cost of an on-demand instance. '''S3''' Remember that storing data in S3 costs money based on the amount of time the data spends in S3. If you don't plan on using the data in the near term, consider moving it to Glacier or Deep Glacier in order to save money on the storage. You can always pull the data back to regular S3 later if needed. '''Tagging''' It is extremely important to tag *every single resource* you use in AWS with the tag key "Owner", with the value being your IAM login name (i.e. your email address). Many EC2 and S3 resources will actually be deleted if not properly tagged, so make sure you do it as soon as you create a resource. Any resource, such as a Lambda, Elastic LoadBalancer, etc can be tagged. This is so when cleanup time comes, we can see who owns what and ask around about the status of things in general. 99d0ae16c386d7f531810dcf911ef423bf90694e 238 237 2021-08-23T19:13:52Z Weiler 3 wikitext text/x-wiki When using AWS, there are a few things to keep in mind in order to keep costs down: '''EC2''' [[Instances:]] When using instances, always pick and instance type that just qualifies for what you need, nothing much larger, otherwise it's wasted CPU time which costs more. Also, shut down your instance as soon as you no longer actively need it, as instances that are shut down do not accrue cost. [[EBS Volumes:]] hen you create an EBS volume, the volume accrues cost whether or not you have it attached to an instance, and whether or not it actually has data in it. Do not use EBS volumes for long term storage when possible, as EBS volumes are much more expensive per GB than S3 for storage. Also, if you need to spawn many instances for a short period of time, always try to use AWS "spot" instances to do the work when possible. Usually it takes a little waiting in order to successfully take advantage of spot instances, but the cost is 1/4 to 1/20 the cost of an on-demand instance. '''S3''' Remember that storing data in S3 costs money based on the amount of time the data spends in S3. If you don't plan on using the data in the near term, consider moving it to Glacier or Deep Glacier in order to save money on the storage. You can always pull the data back to regular S3 later if needed. '''Tagging''' It is extremely important to tag ''every single resource'' you use in AWS with the tag key "Owner", with the value being your IAM login name (i.e. your email address). Many EC2 and S3 resources will actually be deleted if not properly tagged, so make sure you do it as soon as you create a resource. Any resource, such as a Lambda, Elastic LoadBalancer, etc can be tagged. This is so when cleanup time comes, we can see who owns what and ask around about the status of things in general. 3447af70ad5567d0f79bfd7838463a6236b65c92 239 238 2021-08-23T19:15:06Z Weiler 3 wikitext text/x-wiki When using AWS, there are a few things to keep in mind in order to keep costs down: '''EC2''' [[Instances:]] When using instances, always pick and instance type that just qualifies for what you need, nothing much larger, otherwise it's wasted CPU time which costs more. Also, shut down your instance as soon as you no longer actively need it, as instances that are shut down do not accrue cost. [[EBS Volumes:]] When you create an EBS volume, the volume accrues cost whether or not you have it attached to an instance, and whether or not it actually has data in it. Do not use EBS volumes for long term storage when possible, as EBS volumes are much more expensive per GB than S3 for storage. Also, if you need to spawn many instances for a short period of time, always try to use AWS "spot" instances to do the work when possible. Usually it takes a little waiting in order to successfully take advantage of spot instances, but the cost is 1/4 to 1/20 the cost of an on-demand instance. '''S3''' Remember that storing data in S3 costs money based on the amount of time the data spends in S3. If you don't plan on using the data in the near term, consider moving it to Glacier or Deep Glacier in order to save money on the storage. You can always pull the data back to regular S3 later if needed. '''Tagging''' It is extremely important to tag ''every single resource'' you use in AWS with the tag key "Owner", with the value being your IAM login name (i.e. your email address). Many EC2 and S3 resources will actually be deleted if not properly tagged, so make sure you do it as soon as you create a resource. Any resource, such as a Lambda, Elastic LoadBalancer, etc can be tagged. This is so when cleanup time comes, we can see who owns what and ask around about the status of things in general. 2f5a2b425838961ce7ef6841955521a9ea9fbaf3 Access to the Firewalled Compute Servers 0 17 242 139 2021-10-28T18:05:23Z Weiler 3 wikitext text/x-wiki Before you can access the firewalled environment (Prism), you must get VPN access to it, which is detailed here: [[Requirement for users to get GI VPN access]] == Account and Storage Cost == Costs for having an active UNIX account and for storage (per TB) are listed in this document, under "Genomics Project Support", specifically "Genomics IT Systems User Support per user" and "Genomics Data Storage per TB": https://planning.ucsc.edu/budget/rates-and-assessments/recharge-rates/docs/2021-22-approved-recharge-rates.pdf == Account Expiration == Your UNIX account will have an expiration date associated with it after creation. If your account was created in January, February or March, then your account expiration date will be July 1st of the '''current year'''. If the account was created after March, then your expiration date will be July 1st of the '''following year'''. You will receive notice by email when your account is about to expire. To renew, simply ask the PI that sponsored you (which will be included in the notice) to email '''cluster-admin@soe.ucsc.edu''' requesting that your account to be renewed for another year. If your account expires, the account will be suspended and you will no longer be able to login or view any data you may have in our systems. Any automated scripts (owned by you) that run via cron or other mechanisms will cease to function. == Server Types and Management== After confirming your VPN software is working, you can ssh into one of the compute servers behind the VPN: '''crimson.prism''': 256GB RAM, 32 cores, 5.5TB local scratch space '''razzmatazz.prism''': 256GB RAM, 32 cores, 5.5TB local scratch space '''mustard.prism''': 1.5TB RAM, 160 cores, 9TB local scratch space These servers are running CentOS 7.5 Linux. They are managed by the Genomics Institute Cluster Admin group. If you need software installed on either or both of these servers, please make your request by emailing cluster-admin@soe.ucsc.edu. == Storage == These servers mount two types of storage; home directories and group storage directories. Your home directory will be located as "/private/home/username" and has a 30GB quota. The group storage directories are created per PI, and each group directory has a 15TB quota. For example, if David Haussler is the PI that you report to directly, then the directory would exist as /private/groups/hausslerlab. Request access to that group directory and you will then be able to write to it. Each of those group directories are shared by the lab it belongs to, so you must be wary of everyone's data usage and share the 15TB available per group accordingly. On the compute servers you can check your group's current quota usage by using the '/usr/bin/viewquota' command. You can only check the quota of a group you are part of (you would be a member of the UNIX group of the same name). If you wanted to check the quota usage of /private/groups/hausslerlab for example, you would do: $ viewquota hausslerlab Project quota on /export (/dev/mapper/export) Project ID Used Soft Hard Warn/Grace ---------- --------------------------------- hausslerlab 1.8T 15T 16T 00 [------] == Actually Doing Work and Computing == When doing research, running jobs and the like, please be careful of your resource consumption on the server you are on. Don't run too many threads or cores at once if such a thing overruns the RAM available or the disk IO available. If you are not sure of your potential RAM, CPU or disk impact, start small with one or two processes and work your way up from there. Also, before running your stuff, check what else is already happening on the server by using the 'top' command to see who else and what else is running and what kind of resources are already being consumed. If, after starting a process, you realize that the server slows down considerably or becomes unusable, kill your processes and re-evaluate what you need to make things work. These servers are shared resources - be a good neighbor! == The Firewall == All servers are behind a firewall in this environment, and as such, you must connect to the VPN in order to access them. They will not be accessible from the greater Internet without VPN. Although you will be able to connect outbound from them to other servers on the internet to copy data in, sync git repos, stuff like that. It is only inbound connections that will be blocked. All machines behind the firewall have the private domain name suffix of "*.prism". == /scratch Space on the Servers == Each server will generally have a local /scratch filesystem that you can use to store temporary files. '''BE ADVISED''' that /scratch is not backed up, and the data there could disappear in the event of a disk failure or anything else. Do not store important data there. If it is important, it should be moved somewhere else very soon after creation. fe83073a724923be6e17a8926ae63044d4b24f91 AWS S3 Lifecycle Management 0 31 245 2021-12-02T22:55:43Z Righanse 5 Created page with "Test page" wikitext text/x-wiki Test page 9dd22c5b755ad18afcfc0a30b91a6628948fcc77 246 245 2021-12-02T23:28:14Z Righanse 5 wikitext text/x-wiki (This page is a work in progress. The policies defined below are still be adjusted, and are not actively deployed) ==AWS S3 Lifecycle Policy Overview== AWS S3 buckets can be configured with lifecycle policies. These policies allow for automatically changing the storage class of objects based on the last time they were accessed. AWS S3 objects are typically stored in the Standard storage class, which provides easy access, but has relatively high GB/month storage costs. Other storage classes, such as Infrequent Access and Glacier are more suitable for objects that are rarely accessed. These storage class maintain a much lower GB/month cost as compared to the Standard S3 storage class. In order to reduce monthly S3 storage costs, the UCSC GI has implemented global S3 lifecycle policies that transition objects to cheaper storage classes if they have not been accessed recently. Specifically, AWS S3 objects are transitioned as follows: * After '''30''' days of inactivity, objects are transitioned to the '''Infrequent Access''' storage class. * After '''180''' days of inactivity, objects are transitioned to the '''Glacier''' storage class. 8ed0a2a3003660ef34d3e7c6e69e666e5f420c01 247 246 2021-12-02T23:43:42Z Righanse 5 wikitext text/x-wiki (This page is a work in progress. The policies defined below may still be adjusted, and are not actively deployed) ==AWS S3 Lifecycle Policy Overview== AWS S3 buckets can be configured with lifecycle policies. These policies allow for automatically changing the storage class of objects based on the last time they were accessed. AWS S3 objects are typically stored in the Standard storage class, which provides easy access, but has relatively high GB/month storage costs. Other storage classes, such as Infrequent Access and Glacier are more suitable for objects that are rarely accessed. These storage class maintain a much lower GB/month cost as compared to the Standard S3 storage class, but also incur charges for access and retrieval. In order to reduce monthly S3 storage costs, the UCSC GI has implemented global S3 lifecycle policies that transition objects to cheaper storage classes if they have not been accessed recently. Specifically, AWS S3 objects are transitioned as follows: * After '''30''' days of inactivity, objects are transitioned to the '''Infrequent Access''' storage class. * After '''180''' days of inactivity, objects are transitioned to the '''Glacier''' storage class. ==Object Recovery== In case there are objects in S3 that have been transitioned out of the Standard storage class, they can be recovered. Due to the increased access charges of objects in Infrequent Access and Glacier, if an object is expected to be accessed frequently, it should be returned to the Standard storage class. The object recovery method depends on the storage class the object is in, with Glacier being more time consuming and challenging to restore from than Infrequent Access. It is important to note that, in the event an object is recovered, the timer for transitioning the object back to Infrequent Access and Glacier is still running, and the objects will be moved again in the future if they meet the criteria. ===Infrequent Access=== ===Glacier=== 09637d2e05210ecc3ef2e959ac49107d164aae05 248 247 2021-12-09T16:43:15Z Righanse 5 wikitext text/x-wiki (This page is a work in progress. The policies defined below may still be adjusted, and are not actively deployed) ==AWS S3 Lifecycle Policy Overview== AWS S3 buckets can be configured with lifecycle policies. These policies allow for automatically changing the storage class of objects based on the last time they were accessed. AWS S3 objects are typically stored in the Standard storage class, which provides easy access, but has relatively high GB/month storage costs. Other storage classes, such as Infrequent Access and Glacier are more suitable for objects that are rarely accessed. These storage classes maintain a much lower GB/month cost as compared to the Standard S3 storage class, but also incur charges for access and retrieval. It is recommended to utilize the appropriate storage classes for your data. * If you have data that you do not expect to access more than once a '''month''', AWS Infrequent Access is a reasonable storage class to use. * If you have data that you do not expect to access more than once a '''year''', AWS Glacier is a reasonable storage class to use. ==UCSC GI Automated Policy== In order to reduce monthly S3 storage costs, the UCSC GI has implemented global S3 lifecycle policies that transition objects to AWS Intelligent Tiering, which monitors S3 object access patterns and transitions objects accordingly. * Objects uploaded to S3 will remain in the Standard storage class for '''1''' day, at which point they will be transitioned to Intelligent Tiering. * Old and new S3 buckets will have this lifecycle policy automatically attached. AWS Intelligent Tiering functionality: * Intelligent Tiering '''does not''' change object access patterns. This means you can still treat the object as if it were in the Standard storage class. * Intelligent Tiering '''does not''' incur charges for object retrieval from different tiers. For more details on AWS Intelligent Tiering, see the [https://aws.amazon.com/s3/storage-classes/intelligent-tiering/ AWS Docs] 1c260585db1fcd6fa3661f1dd5eff683be11da43 249 248 2021-12-09T17:03:07Z Righanse 5 wikitext text/x-wiki (This page is a work in progress. The policies defined below may still be adjusted, and are not actively deployed) ==AWS S3 Lifecycle Policy Overview== AWS S3 buckets can be configured with lifecycle policies. These policies allow for automatically changing the storage class of objects based on the last time they were accessed. AWS S3 objects are typically stored in the Standard storage class, which provides easy access, but has relatively high GB/month storage costs. Other storage classes, such as Infrequent Access and Glacier are more suitable for objects that are rarely accessed. These storage classes maintain a much lower GB/month cost as compared to the Standard S3 storage class, but also incur charges for access and retrieval. It is recommended to utilize the appropriate storage classes for your data. * If you have data that you do not expect to access more than once a '''month''', AWS Infrequent Access is a reasonable storage class to use. * If you have data that you do not expect to access more than once a '''year''', AWS Glacier is a reasonable storage class to use. ==UCSC GI Automated Policy== In order to reduce monthly S3 storage costs, the UCSC GI has implemented global S3 lifecycle policies that transition objects to AWS Intelligent Tiering, which monitors S3 object access patterns and transitions objects accordingly. * Objects uploaded to S3 will remain in the Standard storage class for '''1''' day, at which point they will be transitioned to Intelligent Tiering. * Old and new S3 buckets will have this lifecycle policy automatically attached. AWS Intelligent Tiering functionality: * Intelligent Tiering '''does not''' change object access patterns. This means you do not need to execute special API commands to access objects. * Intelligent Tiering '''does not''' incur charges for object retrieval from different tiers. For more details on AWS Intelligent Tiering, see the [https://aws.amazon.com/s3/storage-classes/intelligent-tiering/ AWS Docs] 947588956c9fcde8c485449587c769b4967d3119 250 249 2021-12-09T17:18:28Z Righanse 5 wikitext text/x-wiki (This page is a work in progress. The policies defined below may still be adjusted, and are not actively deployed) ==AWS S3 Lifecycle Policy Overview== AWS S3 buckets can be configured with lifecycle policies. These policies allow for automatically changing the storage class of objects based on the last time they were modified or accessed. AWS S3 objects are stored in the Standard storage class by default, which provides easy access, but has relatively high GB/month storage costs. Other storage classes, such as Infrequent Access and Glacier are more suitable for objects that are rarely accessed. These storage classes maintain a much lower GB/month cost as compared to the Standard S3 storage class, but also incur charges for access and retrieval. It is recommended to utilize the appropriate storage classes for your data. * If you have data that you do not expect to access more than once a '''month''', AWS Infrequent Access is a reasonable storage class to use. * If you have data that you do not expect to access more than once a '''year''', AWS Glacier is a reasonable storage class to use. ==UCSC GI Automated Policy== In order to reduce monthly S3 storage costs, the UCSC GI has implemented global S3 lifecycle policies that transition objects to AWS Intelligent Tiering, which monitors S3 object access patterns and transitions objects to more efficient storage classes accordingly. * Objects uploaded to S3 will remain in the Standard storage class for '''1''' day, at which point they will be transitioned to Intelligent Tiering. * Old and new S3 buckets will have this lifecycle policy automatically attached. AWS Intelligent Tiering functionality: * Intelligent Tiering '''does not''' change object access patterns. This means you do not need to execute special API commands to access objects. * Intelligent Tiering '''does not''' incur charges for object retrieval from different tiers. For more details on AWS Intelligent Tiering, see the [https://aws.amazon.com/s3/storage-classes/intelligent-tiering/ AWS Docs] 2340fd051043d68ae67a255ca6ead028701a561f 251 250 2021-12-09T17:19:48Z Righanse 5 wikitext text/x-wiki ==AWS S3 Lifecycle Policy Overview== AWS S3 buckets can be configured with lifecycle policies. These policies allow for automatically changing the storage class of objects based on the last time they were modified or accessed. AWS S3 objects are stored in the Standard storage class by default, which provides easy access, but has relatively high GB/month storage costs. Other storage classes, such as Infrequent Access and Glacier are more suitable for objects that are rarely accessed. These storage classes maintain a much lower GB/month cost as compared to the Standard S3 storage class, but also incur charges for access and retrieval. It is recommended to utilize the appropriate storage classes for your data. * If you have data that you do not expect to access more than once a '''month''', AWS Infrequent Access is a reasonable storage class to use. * If you have data that you do not expect to access more than once a '''year''', AWS Glacier is a reasonable storage class to use. ==UCSC GI Automated Policy== In order to reduce monthly S3 storage costs, the UCSC GI has implemented global S3 lifecycle policies that transition objects to AWS Intelligent Tiering, which monitors S3 object access patterns and transitions objects to more efficient storage classes accordingly. * Objects uploaded to S3 will remain in the Standard storage class for '''1''' day, at which point they will be transitioned to Intelligent Tiering. * Old and new S3 buckets will have this lifecycle policy automatically attached. AWS Intelligent Tiering functionality: * Intelligent Tiering '''does not''' change object access patterns. This means you do not need to execute special API commands to access objects. * Intelligent Tiering '''does not''' incur charges for object retrieval from different tiers. For more details on AWS Intelligent Tiering, see the [https://aws.amazon.com/s3/storage-classes/intelligent-tiering/ AWS Docs] 71814a5d00158e08b021f0708d285cff936d1590 Genomics Institute Computing Information 0 6 252 236 2021-12-09T17:21:05Z Righanse 5 /* Amazon Web Services Information */ wikitext text/x-wiki Welcome to the Genomic Institute Computing Information Repository! Browse the below topics for help in the area you are curious about. ==GI Public Computing Environment== *[[How to access the public servers]] ==GI Firewalled Computing Environment (PRISM)== *[[Access to the Firewalled Compute Servers]] ==VPN Access== *[[Requirement for users to get GI VPN access]] == NIH dbGaP Access Requirements == *[[Requirements for dbGaP Access]] == giCloud Openstack == *[[Overview of giCloud in the Genomics Institute]] *[[Quick Start Instructions to Get Rolling with OpenStack]] == Amazon Web Services Information == *[[Overview of Getting and Using an AWS IAM Account]] *[[AWS Account List and Numbers]] *[[AWS Shared Bucket Usage Graphs]] *[[AWS Best Practices]] *[[AWS S3 Lifecycle Management]] ==Kubernetes Information== *[[Computational Genomics Kubernetes Installation]] *[[Undiagnosed Disease Project Kubernetes Installation]] == Problems or technical support == If you have any problems with the GI computing environment, please send an email to '''cluster-admin@soe.ucsc.edu''' 40e2665a2daedb908ccbe7393d615b8fd769395b 269 252 2023-03-09T01:18:45Z Weiler 3 wikitext text/x-wiki Welcome to the Genomic Institute Computing Information Repository! Browse the below topics for help in the area you are curious about. ==GI Public Computing Environment== *[[How to access the public servers]] ==GI Firewalled Computing Environment (PRISM)== *[[Access to the Firewalled Compute Servers]] ==VPN Access== *[[Requirement for users to get GI VPN access]] == NIH dbGaP Access Requirements == *[[Requirements for dbGaP Access]] == giCloud Openstack == *[[Overview of giCloud in the Genomics Institute]] *[[Quick Start Instructions to Get Rolling with OpenStack]] == Amazon Web Services Information == *[[Overview of Getting and Using an AWS IAM Account]] *[[AWS Account List and Numbers]] *[[AWS Shared Bucket Usage Graphs]] *[[AWS Best Practices]] *[[AWS S3 Lifecycle Management]] == Slurm at the Genomics Institute == *[[Overview of using Slurm]] ==Kubernetes Information== *[[Computational Genomics Kubernetes Installation]] *[[Undiagnosed Disease Project Kubernetes Installation]] == Problems or technical support == If you have any problems with the GI computing environment, please send an email to '''cluster-admin@soe.ucsc.edu''' 4b75bf2c13f246eee824b0a8c2fc77bc21b547d0 271 269 2023-03-09T01:47:42Z Weiler 3 wikitext text/x-wiki Welcome to the Genomic Institute Computing Information Repository! Browse the below topics for help in the area you are curious about. ==GI Public Computing Environment== *[[How to access the public servers]] ==GI Firewalled Computing Environment (PRISM)== *[[Access to the Firewalled Compute Servers]] ==VPN Access== *[[Requirement for users to get GI VPN access]] == NIH dbGaP Access Requirements == *[[Requirements for dbGaP Access]] *[[Annotated Slurm Script]] == giCloud Openstack == *[[Overview of giCloud in the Genomics Institute]] *[[Quick Start Instructions to Get Rolling with OpenStack]] == Amazon Web Services Information == *[[Overview of Getting and Using an AWS IAM Account]] *[[AWS Account List and Numbers]] *[[AWS Shared Bucket Usage Graphs]] *[[AWS Best Practices]] *[[AWS S3 Lifecycle Management]] == Slurm at the Genomics Institute == *[[Overview of using Slurm]] ==Kubernetes Information== *[[Computational Genomics Kubernetes Installation]] *[[Undiagnosed Disease Project Kubernetes Installation]] == Problems or technical support == If you have any problems with the GI computing environment, please send an email to '''cluster-admin@soe.ucsc.edu''' f065dbbfaa1e4eb9a40f2443fe95aa3b9c994d1f 273 271 2023-03-09T01:49:11Z Weiler 3 wikitext text/x-wiki Welcome to the Genomic Institute Computing Information Repository! Browse the below topics for help in the area you are curious about. ==GI Public Computing Environment== *[[How to access the public servers]] ==GI Firewalled Computing Environment (PRISM)== *[[Access to the Firewalled Compute Servers]] ==VPN Access== *[[Requirement for users to get GI VPN access]] == NIH dbGaP Access Requirements == *[[Requirements for dbGaP Access]] == giCloud Openstack == *[[Overview of giCloud in the Genomics Institute]] *[[Quick Start Instructions to Get Rolling with OpenStack]] == Amazon Web Services Information == *[[Overview of Getting and Using an AWS IAM Account]] *[[AWS Account List and Numbers]] *[[AWS Shared Bucket Usage Graphs]] *[[AWS Best Practices]] *[[AWS S3 Lifecycle Management]] == Slurm at the Genomics Institute == *[[Overview of using Slurm]] *[[Annotated Slurm Script]] ==Kubernetes Information== *[[Computational Genomics Kubernetes Installation]] *[[Undiagnosed Disease Project Kubernetes Installation]] == Problems or technical support == If you have any problems with the GI computing environment, please send an email to '''cluster-admin@soe.ucsc.edu''' af56d8300e75c1341921011b36a122df5302ef1a 281 273 2023-03-09T03:24:22Z Weiler 3 wikitext text/x-wiki Welcome to the Genomic Institute Computing Information Repository! Browse the below topics for help in the area you are curious about. ==GI Public Computing Environment== *[[How to access the public servers]] ==GI Firewalled Computing Environment (PRISM)== *[[Access to the Firewalled Compute Servers]] ==VPN Access== *[[Requirement for users to get GI VPN access]] == NIH dbGaP Access Requirements == *[[Requirements for dbGaP Access]] == giCloud Openstack == *[[Overview of giCloud in the Genomics Institute]] *[[Quick Start Instructions to Get Rolling with OpenStack]] == Amazon Web Services Information == *[[Overview of Getting and Using an AWS IAM Account]] *[[AWS Account List and Numbers]] *[[AWS Shared Bucket Usage Graphs]] *[[AWS Best Practices]] *[[AWS S3 Lifecycle Management]] == Slurm at the Genomics Institute == *[[Overview of using Slurm]] *[[Annotated Slurm Script]] *[[Job Arrays]] ==Kubernetes Information== *[[Computational Genomics Kubernetes Installation]] *[[Undiagnosed Disease Project Kubernetes Installation]] == Problems or technical support == If you have any problems with the GI computing environment, please send an email to '''cluster-admin@soe.ucsc.edu''' 40440b0f7318f6b5b7c518cb67879754d9f619fd 285 281 2023-03-09T03:32:25Z Weiler 3 /* Slurm at the Genomics Institute */ wikitext text/x-wiki Welcome to the Genomic Institute Computing Information Repository! Browse the below topics for help in the area you are curious about. ==GI Public Computing Environment== *[[How to access the public servers]] ==GI Firewalled Computing Environment (PRISM)== *[[Access to the Firewalled Compute Servers]] ==VPN Access== *[[Requirement for users to get GI VPN access]] == NIH dbGaP Access Requirements == *[[Requirements for dbGaP Access]] == giCloud Openstack == *[[Overview of giCloud in the Genomics Institute]] *[[Quick Start Instructions to Get Rolling with OpenStack]] == Amazon Web Services Information == *[[Overview of Getting and Using an AWS IAM Account]] *[[AWS Account List and Numbers]] *[[AWS Shared Bucket Usage Graphs]] *[[AWS Best Practices]] *[[AWS S3 Lifecycle Management]] == Slurm at the Genomics Institute == *[[Overview of using Slurm]] *[[Annotated Slurm Script]] *[[Job Arrays]] *[[Quick Reference Guide]] ==Kubernetes Information== *[[Computational Genomics Kubernetes Installation]] *[[Undiagnosed Disease Project Kubernetes Installation]] == Problems or technical support == If you have any problems with the GI computing environment, please send an email to '''cluster-admin@soe.ucsc.edu''' 2250771d33c8a619743fc7c09e53d71dae09a3be 298 285 2023-05-02T05:20:04Z Weiler 3 /* Slurm at the Genomics Institute */ wikitext text/x-wiki Welcome to the Genomic Institute Computing Information Repository! Browse the below topics for help in the area you are curious about. ==GI Public Computing Environment== *[[How to access the public servers]] ==GI Firewalled Computing Environment (PRISM)== *[[Access to the Firewalled Compute Servers]] ==VPN Access== *[[Requirement for users to get GI VPN access]] == NIH dbGaP Access Requirements == *[[Requirements for dbGaP Access]] == giCloud Openstack == *[[Overview of giCloud in the Genomics Institute]] *[[Quick Start Instructions to Get Rolling with OpenStack]] == Amazon Web Services Information == *[[Overview of Getting and Using an AWS IAM Account]] *[[AWS Account List and Numbers]] *[[AWS Shared Bucket Usage Graphs]] *[[AWS Best Practices]] *[[AWS S3 Lifecycle Management]] == Slurm at the Genomics Institute == *[[Overview of using Slurm]] *[[Annotated Slurm Script]] *[[Job Arrays]] *[[GPU Resources]] *[[Quick Reference Guide]] ==Kubernetes Information== *[[Computational Genomics Kubernetes Installation]] *[[Undiagnosed Disease Project Kubernetes Installation]] == Problems or technical support == If you have any problems with the GI computing environment, please send an email to '''cluster-admin@soe.ucsc.edu''' ca0a47ea430bc929bd22feda96fba442396e26fd AWS S3 Lifecycle Management 0 31 253 251 2021-12-09T19:49:37Z Righanse 5 wikitext text/x-wiki ==AWS S3 Lifecycle Policy Overview== AWS S3 buckets can be configured with lifecycle policies. These policies allow for automatically changing the storage class of objects based on the last time they were modified or accessed. AWS S3 objects are stored in the Standard storage class by default, which provides easy access, but has relatively high GB/month storage costs. Other storage classes, such as Infrequent Access and Glacier are more suitable for objects that are rarely accessed. These storage classes maintain a much lower GB/month cost as compared to the Standard S3 storage class, but also incur charges for access and retrieval. It is recommended to utilize the appropriate storage classes for your data. * If you have data that you do not expect to access more than once a '''month''', AWS Infrequent Access is a reasonable storage class to use. * If you have data that you do not expect to access more than once a '''year''', AWS Glacier is a reasonable storage class to use. ==UCSC GI Automated Policy== In order to reduce monthly S3 storage costs, the UCSC GI has implemented global S3 lifecycle policies that transition objects to AWS Intelligent-Tiering, which monitors S3 object access patterns and transitions objects to more efficient storage classes accordingly. * Objects uploaded to S3 will remain in the Standard storage class for '''1''' day, at which point they will be transitioned to Intelligent-Tiering. * Old and new S3 buckets will have this lifecycle policy automatically attached. AWS Intelligent Tiering functionality: * Intelligent-Tiering '''does not''' change object access patterns. This means you do not need to execute special API commands to access objects. * Intelligent-Tiering '''does not''' incur charges for object retrieval from different tiers. For more details on AWS Intelligent Tiering, see the [https://aws.amazon.com/s3/storage-classes/intelligent-tiering/ AWS Docs] 0d71db3c7b59b3772507e00268bc26c088804d53 254 253 2022-03-09T19:48:13Z Anovak 4 Explain how to restore wikitext text/x-wiki ==AWS S3 Lifecycle Policy Overview== AWS S3 buckets can be configured with lifecycle policies. These policies allow for automatically changing the storage class of objects based on the last time they were modified or accessed. AWS S3 objects are stored in the Standard storage class by default, which provides easy access, but has relatively high GB/month storage costs. Other storage classes, such as Infrequent Access and Glacier are more suitable for objects that are rarely accessed. These storage classes maintain a much lower GB/month cost as compared to the Standard S3 storage class, but also incur charges for access and retrieval. It is recommended to utilize the appropriate storage classes for your data. * If you have data that you do not expect to access more than once a '''month''', AWS Infrequent Access is a reasonable storage class to use. * If you have data that you do not expect to access more than once a '''year''', AWS Glacier is a reasonable storage class to use. ==UCSC GI Automated Policy== In order to reduce monthly S3 storage costs, the UCSC GI has implemented global S3 lifecycle policies that transition objects to AWS Intelligent-Tiering, which monitors S3 object access patterns and transitions objects to more efficient storage classes accordingly. * Objects uploaded to S3 will remain in the Standard storage class for '''1''' day, at which point they will be transitioned to Intelligent-Tiering. * Old and new S3 buckets will have this lifecycle policy automatically attached. AWS Intelligent Tiering functionality: * Intelligent-Tiering '''does not''' change object access patterns. This means you do not need to execute special API commands to access objects. * Intelligent-Tiering '''does not''' incur charges for object retrieval from different tiers. For more details on AWS Intelligent Tiering, see the [https://aws.amazon.com/s3/storage-classes/intelligent-tiering/ AWS Docs] ==Restoring Objects== If an object has not been accessed for a while, you may encounter an error like this when trying to access it: <code> An error occurred (InvalidObjectState) when calling the GetObject operation: The operation is not valid for the object's access tier </code> This means that the object is in Glacier, either because somebody put it there, or because Intelligent-Tiering moved it there after it was not accessed for a while. If you want to access it, you will need to restore it (and our AWS account will be billed for doing so). To restore an object, you can use the S3 section of the AWS web console. You can also restore an object from the command line with the AWS CLI tool. To restore the object '''s3://bucket-name/path/to/object.dat''' and make it accessible for the next week, you would issue the command: <code> aws s3api restore-object --restore-request Days=7 --bucket "bucket-name" --key "path/to/object.dat" </code> Note that you need to specify the bucket name and key within the bucket separately, instead of using an S3 URI. 7b738552d74eb8f99f176ceaa407aa1d15e2134b 255 254 2022-03-09T19:58:16Z Anovak 4 /* Restoring Objects */ wikitext text/x-wiki ==AWS S3 Lifecycle Policy Overview== AWS S3 buckets can be configured with lifecycle policies. These policies allow for automatically changing the storage class of objects based on the last time they were modified or accessed. AWS S3 objects are stored in the Standard storage class by default, which provides easy access, but has relatively high GB/month storage costs. Other storage classes, such as Infrequent Access and Glacier are more suitable for objects that are rarely accessed. These storage classes maintain a much lower GB/month cost as compared to the Standard S3 storage class, but also incur charges for access and retrieval. It is recommended to utilize the appropriate storage classes for your data. * If you have data that you do not expect to access more than once a '''month''', AWS Infrequent Access is a reasonable storage class to use. * If you have data that you do not expect to access more than once a '''year''', AWS Glacier is a reasonable storage class to use. ==UCSC GI Automated Policy== In order to reduce monthly S3 storage costs, the UCSC GI has implemented global S3 lifecycle policies that transition objects to AWS Intelligent-Tiering, which monitors S3 object access patterns and transitions objects to more efficient storage classes accordingly. * Objects uploaded to S3 will remain in the Standard storage class for '''1''' day, at which point they will be transitioned to Intelligent-Tiering. * Old and new S3 buckets will have this lifecycle policy automatically attached. AWS Intelligent Tiering functionality: * Intelligent-Tiering '''does not''' change object access patterns. This means you do not need to execute special API commands to access objects. * Intelligent-Tiering '''does not''' incur charges for object retrieval from different tiers. For more details on AWS Intelligent Tiering, see the [https://aws.amazon.com/s3/storage-classes/intelligent-tiering/ AWS Docs] ==Restoring Objects== If an object has not been accessed for a while, you may encounter an error like this when trying to access it: <code> An error occurred (InvalidObjectState) when calling the GetObject operation: The operation is not valid for the object's access tier </code> This means that the object is in Glacier, either because somebody put it there, or because Intelligent-Tiering moved it there after it was not accessed for a while. If you want to access it, you will need to restore it (and our AWS account will be billed for doing so). To restore an object, you can use the S3 section of the AWS web console. You can also restore an object from the command line with the AWS CLI tool. To restore the object '''s3://bucket-name/path/to/object.dat''' and make it accessible for the next week, you would issue the command: <code> aws s3api restore-object --restore-request Days=7 --bucket "bucket-name" --key "path/to/object.dat" </code> Note that you need to specify the bucket name and key within the bucket separately, instead of using an S3 URI. '''Restores from Glacier are not immediate, or even particularly fast.''' [http://vignette2.wikia.nocookie.net/starwars/images/0/0e/Citadel_data_vault.png/revision/latest?cb=20161220040411 Jyn Erso has to go down to the Scarif data vault and find the right data-tape], and it takes a few hours, even if your file is small. 744e0cef0523af0f9fa76f19d70e64660bd7bb3a 256 255 2022-03-10T20:59:42Z Anovak 4 /* Restoring Objects */ wikitext text/x-wiki ==AWS S3 Lifecycle Policy Overview== AWS S3 buckets can be configured with lifecycle policies. These policies allow for automatically changing the storage class of objects based on the last time they were modified or accessed. AWS S3 objects are stored in the Standard storage class by default, which provides easy access, but has relatively high GB/month storage costs. Other storage classes, such as Infrequent Access and Glacier are more suitable for objects that are rarely accessed. These storage classes maintain a much lower GB/month cost as compared to the Standard S3 storage class, but also incur charges for access and retrieval. It is recommended to utilize the appropriate storage classes for your data. * If you have data that you do not expect to access more than once a '''month''', AWS Infrequent Access is a reasonable storage class to use. * If you have data that you do not expect to access more than once a '''year''', AWS Glacier is a reasonable storage class to use. ==UCSC GI Automated Policy== In order to reduce monthly S3 storage costs, the UCSC GI has implemented global S3 lifecycle policies that transition objects to AWS Intelligent-Tiering, which monitors S3 object access patterns and transitions objects to more efficient storage classes accordingly. * Objects uploaded to S3 will remain in the Standard storage class for '''1''' day, at which point they will be transitioned to Intelligent-Tiering. * Old and new S3 buckets will have this lifecycle policy automatically attached. AWS Intelligent Tiering functionality: * Intelligent-Tiering '''does not''' change object access patterns. This means you do not need to execute special API commands to access objects. * Intelligent-Tiering '''does not''' incur charges for object retrieval from different tiers. For more details on AWS Intelligent Tiering, see the [https://aws.amazon.com/s3/storage-classes/intelligent-tiering/ AWS Docs] ==Restoring Objects== If an object has not been accessed for a while, you may encounter an error like this when trying to access it: <code> An error occurred (InvalidObjectState) when calling the GetObject operation: The operation is not valid for the object's access tier </code> This means that the object is in Glacier, either because somebody put it there, or because Intelligent-Tiering moved it there after it was not accessed for a while. If you want to access it, you will need to restore it (and our AWS account will be billed for doing so). To restore an object, you can use the S3 section of the AWS web console. You can also restore an object from the command line with the AWS CLI tool. To restore the object '''s3://bucket-name/path/to/object.dat''', you would issue the command: <code> aws s3api restore-object --restore-request "{}" --bucket "bucket-name" --key "path/to/object.dat" </code> If the object was manually put in Glacier, you would instead need <code>--restore-request "Days=7"</code>, or some other number of days. Note that you need to specify the bucket name and key within the bucket separately, instead of using an S3 URI. '''Restores from Glacier are not immediate, or even particularly fast.''' [http://vignette2.wikia.nocookie.net/starwars/images/0/0e/Citadel_data_vault.png/revision/latest?cb=20161220040411 Jyn Erso has to go down to the Scarif data vault and find the right data-tape], and it takes a few hours, even if your file is small. f0a578cf36f62835edfbd9e56f9c4fa9b6227a1f AWS Account List and Numbers 0 22 257 240 2022-07-08T18:57:34Z Weiler 3 wikitext text/x-wiki This is a list of our currently available AWS accounts and their account numbers: ucsc-bd2k : 862902209576 ucsc-toil-dev : 318423852362 ucsc-vg-dev : 781907127277 ucsc-platform-dev : 719818754276 comparative-genomics-dev : 162786355865 nanopore-dev : 270442831226 ucsc-cgp-production : 097093801910 platform-hca-dev : 122796619775 anvil-dev : 608666466534 gi-gateway : 652235167018 pangenomics : 422448306679 braingeneers : 443872533066 ucsctreehouse : 238605363322 ucsc-bisti-dev : 851631505710 dockstore-dev : 635220370222 ucsc-spatial : 541180793903 platform-hca-prod : 542754589326 miga-lab : 156518225147 platform-anvil-dev : 289950828509 platform-anvil-prod : 465330168186 d013f67525850d75972a235ef615d9935d64d6a4 261 257 2022-08-11T18:29:18Z Weiler 3 wikitext text/x-wiki This is a list of our currently available AWS accounts and their account numbers: ucsc-bd2k : 862902209576 ucsc-toil-dev : 318423852362 ucsc-vg-dev : 781907127277 ucsc-platform-dev : 719818754276 comparative-genomics-dev : 162786355865 nanopore-dev : 270442831226 ucsc-cgp-production : 097093801910 platform-hca-dev : 122796619775 anvil-dev : 608666466534 gi-gateway : 652235167018 pangenomics : 422448306679 braingeneers : 443872533066 ucsctreehouse : 238605363322 ucsc-bisti-dev : 851631505710 dockstore-dev : 635220370222 ucsc-spatial : 541180793903 platform-hca-prod : 542754589326 miga-lab : 156518225147 platform-anvil-dev : 289950828509 platform-anvil-prod : 465330168186 agc-runs : 598929688444 e3637c9d7e6e3c8bb5b6b2ac8a57a68d6c81947d 264 261 2023-01-12T19:13:23Z Weiler 3 wikitext text/x-wiki This is a list of our currently available AWS accounts and their account numbers: ucsc-bd2k : 862902209576 ucsc-toil-dev : 318423852362 ucsc-vg-dev : 781907127277 ucsc-platform-dev : 719818754276 comparative-genomics-dev : 162786355865 nanopore-dev : 270442831226 ucsc-cgp-production : 097093801910 platform-hca-dev : 122796619775 anvil-dev : 608666466534 gi-gateway : 652235167018 pangenomics : 422448306679 braingeneers : 443872533066 ucsctreehouse : 238605363322 ucsc-bisti-dev : 851631505710 dockstore-dev : 635220370222 ucsc-spatial : 541180793903 platform-hca-prod : 542754589326 miga-lab : 156518225147 platform-anvil-dev : 289950828509 platform-anvil-prod : 465330168186 platform-anvil-portal : 166384485414 agc-runs : 598929688444 a78e5444a1c9b46db00eaf5689d8220294eb92e6 266 264 2023-02-28T23:36:42Z Weiler 3 wikitext text/x-wiki This is a list of our currently available AWS accounts and their account numbers: ucsc-bd2k : 862902209576 ucsc-toil-dev : 318423852362 ucsc-vg-dev : 781907127277 ucsc-platform-dev : 719818754276 comparative-genomics-dev : 162786355865 nanopore-dev : 270442831226 ucsc-cgp-production : 097093801910 platform-hca-dev : 122796619775 anvil-dev : 608666466534 gi-gateway : 652235167018 pangenomics : 422448306679 braingeneers : 443872533066 ucsctreehouse : 238605363322 ucsc-bisti-dev : 851631505710 ucsc-genome-browser : 784962239183 dockstore-dev : 635220370222 ucsc-spatial : 541180793903 platform-hca-prod : 542754589326 miga-lab : 156518225147 platform-anvil-dev : 289950828509 platform-anvil-prod : 465330168186 platform-anvil-portal : 166384485414 agc-runs : 598929688444 80a0c054ef7242e195e2035d3cfc2751089bad9e Requirement for users to get GI VPN access 0 9 258 244 2022-07-18T21:33:47Z Jgarcia 2 wikitext text/x-wiki Before you are allowed access to our firewalled/secure area ("Prism" or CIRM"), you have to complete 3 items and provide the completed certificates or forms: '''1''': You must take and complete the NIH Public Security Refresher Course online. You must complete the course in a single continuous sitting: https://irtsectraining.nih.gov/publicUser.aspx Click on the "2021 Information Security, Counterintelligence, Privacy Awareness, Records Management Refresher" link to begin the course. At the end you will be able to print out or save the completion certificate that should have your name on it. '''2''': You need to print and sign the Genomics Institute VPN User Agreement, located here for download: [[Media:GI_VPN_Policy.pdf]] '''3''': Please print, read and sign the last page of the NIH Genomic Data Sharing Policy agreement, located here for download. By signing the document you agree that you have read and understand the policies described therein and that you agree to abide by those policies: [[Media:NIH_GDS_Policy.pdf]] When you have the three documents described above ready, please complete this form: https://app.smartsheet.com/b/form/a76dbd90ba0240ab9ea9d39b390586ce. There are two parts in this process. 1. For the user, please fill in ALL required fields '''and attach''' all three required documents described above. 2. For the Sponsor/PI - you will receive an email from Smartsheet. Please fill in all required fields and submit. We will receive your completed request and we will create your account and go over the details via a short zoom meeting with you. You will need access to the "eduroam" wireless network '''prior''' to your zoom appointment if you are on campus. Other UCSC wireless networks such as "cruznet" will not work with our VPN software, so please make sure your laptop works with eduroam before coming to your appointment. Instructions on how to get on eduroam are detailed here: https://its.ucsc.edu/wireless/eduroam.html When using the VPN software off-campus, it will usually work unless the wireless network you are on has restrictions preventing it from functioning. Some other universities have such restrictions (notably UCSF), but most other wireless network and home wireless networks should work fine. Before your appointment, please make sure you install the appropriate OpenVPN software on your laptop: A laptop running OS X, Windows or Ubuntu For Macs, please download and install '''Tunnelblick''' from https://tunnelblick.net/downloads.html. Select the Latest Stable version. For Windows, please download and install '''OpenVPN Client''' from https://openvpn.net/index.php/open-source/downloads.html. Select ''openvpn-install-x.x.x-xxxx.exe'' For Ubuntu, please install network-manager-openvpn by typing: sudo apt-get install network-manager-openvpn network-manager-openvpn-gnome Please do NOT worry about how to configure the software at this point. We will help you to set it up at your appointment. We will correspond with you via email on when the appointment will be. The zoom meeting can take up to 30 minutes per person depending on whether or not any issues come up during the software setup. If you show up for your appointment without one (or more) of the above outlined requirements, we will have to reschedule your appointment for a time when you can arrive after completing the above requirements. '''PLEASE NOTE:''' Because of the overhead required in setting up VPN access, please only request access if you have an immediate need to work on data that exists behind the firewall. We have had a decent number of people request access and go through the setup but then never use it. In other words, please do not request access because "one day you might need it", but because you '''do''' actually need it! '''ALSO NOTE:''' VPN accounts expire after one year from the date of first gaining access. To renew for another year you will need your PI/sponsor to send us a note asking for renewal. 773d42867b8e02d4566f0650c1259201890dd55e 262 258 2022-11-17T00:08:10Z Jgarcia 2 wikitext text/x-wiki Before you are allowed access to our firewalled/secure area ("Prism" or CIRM"), you have to complete 3 items and provide the completed certificates or forms: '''1''': You must take and complete the NIH Public Security Refresher Course online. You must complete the course in a single continuous sitting: https://irtsectraining.nih.gov/FYR/00_000.aspx Click on the "2021 Information Security, Counterintelligence, Privacy Awareness, Records Management Refresher" link to begin the course. At the end you will be able to print out or save the completion certificate that should have your name on it. '''2''': You need to print and sign the Genomics Institute VPN User Agreement, located here for download: [[Media:GI_VPN_Policy.pdf]] '''3''': Please print, read and sign the last page of the NIH Genomic Data Sharing Policy agreement, located here for download. By signing the document you agree that you have read and understand the policies described therein and that you agree to abide by those policies: [[Media:NIH_GDS_Policy.pdf]] When you have the three documents described above ready, please complete this form: https://app.smartsheet.com/b/form/a76dbd90ba0240ab9ea9d39b390586ce. There are two parts in this process. 1. For the user, please fill in ALL required fields '''and attach''' all three required documents described above. 2. For the Sponsor/PI - you will receive an email from Smartsheet. Please fill in all required fields and submit. We will receive your completed request and we will create your account and go over the details via a short zoom meeting with you. You will need access to the "eduroam" wireless network '''prior''' to your zoom appointment if you are on campus. Other UCSC wireless networks such as "cruznet" will not work with our VPN software, so please make sure your laptop works with eduroam before coming to your appointment. Instructions on how to get on eduroam are detailed here: https://its.ucsc.edu/wireless/eduroam.html When using the VPN software off-campus, it will usually work unless the wireless network you are on has restrictions preventing it from functioning. Some other universities have such restrictions (notably UCSF), but most other wireless network and home wireless networks should work fine. Before your appointment, please make sure you install the appropriate OpenVPN software on your laptop: A laptop running OS X, Windows or Ubuntu For Macs, please download and install '''Tunnelblick''' from https://tunnelblick.net/downloads.html. Select the Latest Stable version. For Windows, please download and install '''OpenVPN Client''' from https://openvpn.net/index.php/open-source/downloads.html. Select ''openvpn-install-x.x.x-xxxx.exe'' For Ubuntu, please install network-manager-openvpn by typing: sudo apt-get install network-manager-openvpn network-manager-openvpn-gnome Please do NOT worry about how to configure the software at this point. We will help you to set it up at your appointment. We will correspond with you via email on when the appointment will be. The zoom meeting can take up to 30 minutes per person depending on whether or not any issues come up during the software setup. If you show up for your appointment without one (or more) of the above outlined requirements, we will have to reschedule your appointment for a time when you can arrive after completing the above requirements. '''PLEASE NOTE:''' Because of the overhead required in setting up VPN access, please only request access if you have an immediate need to work on data that exists behind the firewall. We have had a decent number of people request access and go through the setup but then never use it. In other words, please do not request access because "one day you might need it", but because you '''do''' actually need it! '''ALSO NOTE:''' VPN accounts expire after one year from the date of first gaining access. To renew for another year you will need your PI/sponsor to send us a note asking for renewal. 6e2ec837572a0365d694410a662172b6e5f16f65 263 262 2022-11-17T00:10:00Z Jgarcia 2 wikitext text/x-wiki Before you are allowed access to our firewalled/secure area ("Prism" or CIRM"), you have to complete 3 items and provide the completed certificates or forms: '''1''': You must take and complete the NIH Public Security Refresher Course online. You must complete the course in a single continuous sitting: https://irtsectraining.nih.gov/FYR/00_000.aspx The course is titled "2022 Information Security and Management Refresher". At the end you will be able to print out or save the completion certificate that should have your name on it. '''2''': You need to print and sign the Genomics Institute VPN User Agreement, located here for download: [[Media:GI_VPN_Policy.pdf]] '''3''': Please print, read and sign the last page of the NIH Genomic Data Sharing Policy agreement, located here for download. By signing the document you agree that you have read and understand the policies described therein and that you agree to abide by those policies: [[Media:NIH_GDS_Policy.pdf]] When you have the three documents described above ready, please complete this form: https://app.smartsheet.com/b/form/a76dbd90ba0240ab9ea9d39b390586ce. There are two parts in this process. 1. For the user, please fill in ALL required fields '''and attach''' all three required documents described above. 2. For the Sponsor/PI - you will receive an email from Smartsheet. Please fill in all required fields and submit. We will receive your completed request and we will create your account and go over the details via a short zoom meeting with you. You will need access to the "eduroam" wireless network '''prior''' to your zoom appointment if you are on campus. Other UCSC wireless networks such as "cruznet" will not work with our VPN software, so please make sure your laptop works with eduroam before coming to your appointment. Instructions on how to get on eduroam are detailed here: https://its.ucsc.edu/wireless/eduroam.html When using the VPN software off-campus, it will usually work unless the wireless network you are on has restrictions preventing it from functioning. Some other universities have such restrictions (notably UCSF), but most other wireless network and home wireless networks should work fine. Before your appointment, please make sure you install the appropriate OpenVPN software on your laptop: A laptop running OS X, Windows or Ubuntu For Macs, please download and install '''Tunnelblick''' from https://tunnelblick.net/downloads.html. Select the Latest Stable version. For Windows, please download and install '''OpenVPN Client''' from https://openvpn.net/index.php/open-source/downloads.html. Select ''openvpn-install-x.x.x-xxxx.exe'' For Ubuntu, please install network-manager-openvpn by typing: sudo apt-get install network-manager-openvpn network-manager-openvpn-gnome Please do NOT worry about how to configure the software at this point. We will help you to set it up at your appointment. We will correspond with you via email on when the appointment will be. The zoom meeting can take up to 30 minutes per person depending on whether or not any issues come up during the software setup. If you show up for your appointment without one (or more) of the above outlined requirements, we will have to reschedule your appointment for a time when you can arrive after completing the above requirements. '''PLEASE NOTE:''' Because of the overhead required in setting up VPN access, please only request access if you have an immediate need to work on data that exists behind the firewall. We have had a decent number of people request access and go through the setup but then never use it. In other words, please do not request access because "one day you might need it", but because you '''do''' actually need it! '''ALSO NOTE:''' VPN accounts expire after one year from the date of first gaining access. To renew for another year you will need your PI/sponsor to send us a note asking for renewal. b9d74e23b865cce076353c7742e495b2492fa50c Overview of Getting and Using an AWS IAM Account 0 21 259 220 2022-07-22T23:53:16Z Anovak 4 /* Switching Roles into Another AWS Account */ Note where to get the name. wikitext text/x-wiki __TOC__ == Getting AWS (Amazon Web Services) Access == The Genomics Institute has a series of AWS Accounts that all support different projects. Often if you become associated with one or more of those projects, you will need access to that account or accounts. The way we are managing AWS IAM Account Access is that we have one AWS account that is the 'top level' account that everyone gets access to, and then, once you log in there, you can "Switch Role" into another sub-account that you are running things in. To get access, you will need your PI or Project Manager to email cluster-admin (cluster-admin@soe.ucsc.edu) asking for an AWS account for you, and also in that email to name the projects you will have access to. The cluster-admin group will contact you with your credentials to login. Once you login, you can change your password if you want to and also you will be able to set up MFA (Multi Factor Authentication) for your account. You will be required to use MFA in order to "Switch Role" into any of the sub-accounts for the projects you are working on. The login URL to use when logging in to the top level account is listed below. The top level account is known as "gi-gateway": https://gi-gateway.signin.aws.amazon.com/console When you login, you '''may''' see a couple error messages on the AWS dashboard saying you don't have access to view certain resources - '''this is normal''', so just ignore the error messages. == Configuring Account Credentials == Once you login to the gi-gateway, you will have very few permissions to do anything there - which is normal, since you will not be working in that account anyway. The gi-gateway account is just there to authenticate you to AWS. '''Changing Your Password''' You can change your password by clicking on your username on the top right of the web browser window, just to the right of the little bell. If your username is melinda@ucsc.edu, for example: * Click '''"melinda@ucsc.edu @ gi-gateway"''' on the top right of your browser window. * Click the '''"My Security Credentials"''' drop-down menu option. * Click the '''"Change Password"''' button to change your password. Note that we have a password strength policy in place, so your password must conform to the following requirements: * Your password must be at least 10 characters long * Your password must contain at least one lowercase letter * Your password must contain at least one non-alphanumeric character * Your password must contain at least one number You will also need to configure MFA on your account before you will be allowed to switch roles into another account. '''Configuring MFA''' To configure MFA (Multi Factor Authentication), the most common way to do it is to use '''Google Authenticator''', which is an app available for Apple and Android based cell phones and mobile devices. The app is free, simply download it from the app store to your cell phone or tablet to get started. Other MFA apps may also work but we have not tested everything out there. Once you have Google Authenticator installed, log into the gi-gateway account using the above URL, then: * Click '''"melinda@ucsc.edu @ gi-gateway"''' on the top right of your browser window (again, melinda@ucsc.edu is an example). * Click the '''"My Security Credentials"''' drop-down menu option. * Scroll down to the MFA (Multi-Factor Authentication) section of the page, and click '''"Assign MFA Device"'''. * In the following menu select '''"Virtual MFA Device"'''. * In the following window click the '''"Show QR Code"''' link, and the MFA QR barcode will appear on your screen. * Open the Google Authenticator app on your mobile device, and click the little "+" symbol in the top right corner of the app to add an account. * You will then need to select "Scan Barcode" in the Google Authenticaor app to continue, and aim your mobile device camera at the QR barcode. * The new account MFA device should then be set up and you should see a 6 digit number with a small timer to the right of it. You must type in one 6 digit code that it displays into your web browser when asked, then wait for the next code to appear after the timer expires, and type that into the second field. It should then inform you that you have successfully associated an MFA device with your account. Once you have associated an MFA device with the 'gi-gateway' Account, '''log out''', then log back in. It will ask for your username and password, and then ask for your MFA code, which you can view by opening Google Authenticator and seeing what code it is displaying at that time. The code changes every 30 seconds or so. '''You must log out first and log back in using MFA in order to be able to switch roles!!!''' == Switching Roles into Another AWS Account == Now that you have configured a password and enabled MFA, you will be allowed to "Switch Roles" into another account such that you can begin work there. The first time you switch roles into an account it will ask you a few questions, but subsequently it will remember which roles you have access to and they will become a menu item you can click on to quickly switch roles. First, you need the name of the account you want to switch to. Select the name from the list at [AWS Account List and Numbers]. Let's assume that you want to switch to the 'pangenomics' AWS account, and you have been already granted access to do so by the cluster-admin group. After logging into the 'gi-gateway' account at the URL listed here (same as above): https://gi-gateway.signin.aws.amazon.com/console Do the following to switch roles into the 'pangenomics' account (as an example): * Click '''"melinda@ucsc.edu @ gi-gateway"''' on the top right of your browser window (again, melinda@ucsc.edu is an example). * Click the '''"Switch Role"''' option in the drop-down menu. * In the following menu it will ask you about the role you will be assuming. In our example we will use the following: -Account* = pangenomics -Role* = developer -Display Name = [leave blank, or use a short phrase] -Color = [choose a color for this role] * Then click the "Switch Role" button. If all went well you should be dumped into the 'pangenomics' account, and you should be identified in the top right hand corner of the page as '''"developer @ pangenomics"''', indicating your role and the account you are active in. You can then work as normal in that account. If you have not yet been given access to that role, you will receive an error message and not be allowed to switch roles. '''NOTE:''' When you switch roles, it may dump you into a region that you don't expect it to. Always verify the region you are in by looking at the top right of the web page - it will display your region there. Most of our stuff exists in "Oregon" (us-west-2), but some items appear in other regions on a per-case basis. If you wish to switch context back to the 'gi-gateway' account in order to manage something, or to switch to another role in another account, simple: * Click '''"developer @ pangenomics"''' in the top right corner of the window. * Select '''"Back to melinda@ucsc.edu"''' You will then be sent back to the 'gi-gateway' context, and you can add another role to switch into, manage your credentials and further switch roles. == API Access and Secret Keys == If you require programmatic access to AWS, you will very likely be familiar with the AWS concept of Access Keys and Secret Keys, which can be used by scripts to authenticate yourself to AWS and use the APIs there without using the web console to authenticate. In the past, access keys and secret keys could be used by users with no further authentication. This introduces a security risk, as the management of those keys must be carefully guarded - if anyone gets your keys, they can rack up charges on your AWS account without your knowledge! Using the "Assume Role" mechanism we are now using, Access Keys and Secret Keys can still be created by users '''while logged into the gi-gateway account only'''. Do not try to create keys while you have "Switched Roles" into another account. Keys you create in the top level 'gi-gateway' account will work for you in any sub-account you have access to switch roles to. You will need to do a little more configuration for your keys to work from a UNIX command line however. To set up your access and secret keys for the first time (again, logged into the 'gi-gateway' account only), follow these instructions. Once you log into the gi-gateway web interface, click on your username in the top right corner of the browser window, then click "My Security Credentials". In that screen you will see an "Access Keys" section, and you will have one key listed. Delete that key (using the "Delete" button on the right side of the key), then create a new key using the "Create Access Key" button. It will show you your access and secret key ONCE, so make sure to copy and paste it somewhere. It should be noted that we recommend awscli version 1.16.187 or later, as earlier versions have documented issues with using profiles and MFA related actions. You can determine your version of awscli by doing: aws --version === Entering Base Credentials === Generically, if you plan on using keys for API Access, minimally you will need to configure the "aws" utility and then tweak the config a bit for our setup. To start, run "aws configure". It should look something like this (put in your access and secret keys that you created in the previous step): $ aws configure AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE AWS Secret Access Key [None]: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY Default region name [None]: us-west-2 Default output format [None]: Most folks do that to start. It creates two files: ~/.aws/config ~/.aws/credentials Those two files are important to access AWS via the 'aws' command. '''~/.aws/credentials''' This file contains your access key and secret key, and should not need to be modified after running 'aws configure'. Your same keys can be used to access any roles in any accounts you have access to. '''~/.aws/config''' This file contains some account information you will need to tweak. There are a few ways you could set it up. === Adjusting Configuration for Toil or a Single Role === If you usually use a single role for a single project, or if you need to use Toil with a particular role, you should configure it like this, so that that role is automatically assumed for every operation by default: [default] region = us-west-2 source_profile = default role_arn = arn:aws:iam::422448306679:role/developer mfa_serial = arn:aws:iam::652235167018:mfa/melinda@ucsc.edu duration_seconds = 43200 The "role_arn" line contains the role and account number you are accessing. You can see a list of live account numbers here: [[AWS Account List and Numbers]] Find the account number you need and enter it on the role_arn line, as well as the role name. You will get the role name from the cluster-admin group when you get access. The 'mfa_serial' line contains the identifier for your MFA device. It will always look like '''"arn:aws:iam::652235167018:mfa/[your_iam_username]"'''. The account number listed there will always be "652235167018" because that is the account number of the top level "gi-gateway" account. The "duration_seconds" parameter says that your session token will be 43200 seconds long (12 hours). That means you will only have to authenticate with MFA once every 12 hours. 12 hours is the maximum you can request, although you can specify less than that. This means it won't ask you for MFA every time you run a command for the next 12 hours. Once that is configured, you should be able to use the aws command without any profile specified, and have it automatically assume a role to grant you access: $ aws s3 ls It will ask you for your MFA code and then run the command. Once you enter the MFA code, the token it creates will be valid for 12 hours if you specified "duration_seconds = 43200", or if you omitted that line, the default session duration is one hour, so you can run other 'aws' cli commands without the need to re-authenticate with MFA for the duration of the session. After the session expires, you will need to authenticate via MFA again. === Adjusting Configuration for Multiple Roles === If you have multiple roles that you use equally often, and you don't need to use Toil, you can configure it something like this, with multiple profiles: [default] region = us-west-2 [profile pangenomics-developer] source_profile = default role_arn = arn:aws:iam::422448306679:role/developer mfa_serial = arn:aws:iam::652235167018:mfa/melinda@ucsc.edu duration_seconds = 43200 Once that is configured, you should be able to reference the profile you just created when using the aws command, like so: $ aws s3 ls --profile pangenomics-developer ==Tag Your Resources== When you start using AWS resources (instances, networks, etc), it is very important that you "tag" your resources with the "Owner" tag (note the capital "O"). "Owner" is the key, and the value assigned to it will be your IAM username (i.e. your email address). So, for example, if I spin up an instance, I would tag it during or after creation with something like: Owner = bob@ucsc.edu If you do not tag your instances, '''they will automatically be terminated within 10 minutes.''' Tag your instances especially, but tag every resource you create! This allows us to perform accounting tasks much more easily and allows the Program Managers to know which resources are controlled by who. 609de9a14a96b0b114f4b8cb8ed582fff6302402 260 259 2022-07-22T23:53:41Z Anovak 4 Fix link wikitext text/x-wiki __TOC__ == Getting AWS (Amazon Web Services) Access == The Genomics Institute has a series of AWS Accounts that all support different projects. Often if you become associated with one or more of those projects, you will need access to that account or accounts. The way we are managing AWS IAM Account Access is that we have one AWS account that is the 'top level' account that everyone gets access to, and then, once you log in there, you can "Switch Role" into another sub-account that you are running things in. To get access, you will need your PI or Project Manager to email cluster-admin (cluster-admin@soe.ucsc.edu) asking for an AWS account for you, and also in that email to name the projects you will have access to. The cluster-admin group will contact you with your credentials to login. Once you login, you can change your password if you want to and also you will be able to set up MFA (Multi Factor Authentication) for your account. You will be required to use MFA in order to "Switch Role" into any of the sub-accounts for the projects you are working on. The login URL to use when logging in to the top level account is listed below. The top level account is known as "gi-gateway": https://gi-gateway.signin.aws.amazon.com/console When you login, you '''may''' see a couple error messages on the AWS dashboard saying you don't have access to view certain resources - '''this is normal''', so just ignore the error messages. == Configuring Account Credentials == Once you login to the gi-gateway, you will have very few permissions to do anything there - which is normal, since you will not be working in that account anyway. The gi-gateway account is just there to authenticate you to AWS. '''Changing Your Password''' You can change your password by clicking on your username on the top right of the web browser window, just to the right of the little bell. If your username is melinda@ucsc.edu, for example: * Click '''"melinda@ucsc.edu @ gi-gateway"''' on the top right of your browser window. * Click the '''"My Security Credentials"''' drop-down menu option. * Click the '''"Change Password"''' button to change your password. Note that we have a password strength policy in place, so your password must conform to the following requirements: * Your password must be at least 10 characters long * Your password must contain at least one lowercase letter * Your password must contain at least one non-alphanumeric character * Your password must contain at least one number You will also need to configure MFA on your account before you will be allowed to switch roles into another account. '''Configuring MFA''' To configure MFA (Multi Factor Authentication), the most common way to do it is to use '''Google Authenticator''', which is an app available for Apple and Android based cell phones and mobile devices. The app is free, simply download it from the app store to your cell phone or tablet to get started. Other MFA apps may also work but we have not tested everything out there. Once you have Google Authenticator installed, log into the gi-gateway account using the above URL, then: * Click '''"melinda@ucsc.edu @ gi-gateway"''' on the top right of your browser window (again, melinda@ucsc.edu is an example). * Click the '''"My Security Credentials"''' drop-down menu option. * Scroll down to the MFA (Multi-Factor Authentication) section of the page, and click '''"Assign MFA Device"'''. * In the following menu select '''"Virtual MFA Device"'''. * In the following window click the '''"Show QR Code"''' link, and the MFA QR barcode will appear on your screen. * Open the Google Authenticator app on your mobile device, and click the little "+" symbol in the top right corner of the app to add an account. * You will then need to select "Scan Barcode" in the Google Authenticaor app to continue, and aim your mobile device camera at the QR barcode. * The new account MFA device should then be set up and you should see a 6 digit number with a small timer to the right of it. You must type in one 6 digit code that it displays into your web browser when asked, then wait for the next code to appear after the timer expires, and type that into the second field. It should then inform you that you have successfully associated an MFA device with your account. Once you have associated an MFA device with the 'gi-gateway' Account, '''log out''', then log back in. It will ask for your username and password, and then ask for your MFA code, which you can view by opening Google Authenticator and seeing what code it is displaying at that time. The code changes every 30 seconds or so. '''You must log out first and log back in using MFA in order to be able to switch roles!!!''' == Switching Roles into Another AWS Account == Now that you have configured a password and enabled MFA, you will be allowed to "Switch Roles" into another account such that you can begin work there. The first time you switch roles into an account it will ask you a few questions, but subsequently it will remember which roles you have access to and they will become a menu item you can click on to quickly switch roles. First, you need the name of the account you want to switch to. Select the name from the list at [[AWS Account List and Numbers]]. Let's assume that you want to switch to the 'pangenomics' AWS account, and you have been already granted access to do so by the cluster-admin group. After logging into the 'gi-gateway' account at the URL listed here (same as above): https://gi-gateway.signin.aws.amazon.com/console Do the following to switch roles into the 'pangenomics' account (as an example): * Click '''"melinda@ucsc.edu @ gi-gateway"''' on the top right of your browser window (again, melinda@ucsc.edu is an example). * Click the '''"Switch Role"''' option in the drop-down menu. * In the following menu it will ask you about the role you will be assuming. In our example we will use the following: -Account* = pangenomics -Role* = developer -Display Name = [leave blank, or use a short phrase] -Color = [choose a color for this role] * Then click the "Switch Role" button. If all went well you should be dumped into the 'pangenomics' account, and you should be identified in the top right hand corner of the page as '''"developer @ pangenomics"''', indicating your role and the account you are active in. You can then work as normal in that account. If you have not yet been given access to that role, you will receive an error message and not be allowed to switch roles. '''NOTE:''' When you switch roles, it may dump you into a region that you don't expect it to. Always verify the region you are in by looking at the top right of the web page - it will display your region there. Most of our stuff exists in "Oregon" (us-west-2), but some items appear in other regions on a per-case basis. If you wish to switch context back to the 'gi-gateway' account in order to manage something, or to switch to another role in another account, simple: * Click '''"developer @ pangenomics"''' in the top right corner of the window. * Select '''"Back to melinda@ucsc.edu"''' You will then be sent back to the 'gi-gateway' context, and you can add another role to switch into, manage your credentials and further switch roles. == API Access and Secret Keys == If you require programmatic access to AWS, you will very likely be familiar with the AWS concept of Access Keys and Secret Keys, which can be used by scripts to authenticate yourself to AWS and use the APIs there without using the web console to authenticate. In the past, access keys and secret keys could be used by users with no further authentication. This introduces a security risk, as the management of those keys must be carefully guarded - if anyone gets your keys, they can rack up charges on your AWS account without your knowledge! Using the "Assume Role" mechanism we are now using, Access Keys and Secret Keys can still be created by users '''while logged into the gi-gateway account only'''. Do not try to create keys while you have "Switched Roles" into another account. Keys you create in the top level 'gi-gateway' account will work for you in any sub-account you have access to switch roles to. You will need to do a little more configuration for your keys to work from a UNIX command line however. To set up your access and secret keys for the first time (again, logged into the 'gi-gateway' account only), follow these instructions. Once you log into the gi-gateway web interface, click on your username in the top right corner of the browser window, then click "My Security Credentials". In that screen you will see an "Access Keys" section, and you will have one key listed. Delete that key (using the "Delete" button on the right side of the key), then create a new key using the "Create Access Key" button. It will show you your access and secret key ONCE, so make sure to copy and paste it somewhere. It should be noted that we recommend awscli version 1.16.187 or later, as earlier versions have documented issues with using profiles and MFA related actions. You can determine your version of awscli by doing: aws --version === Entering Base Credentials === Generically, if you plan on using keys for API Access, minimally you will need to configure the "aws" utility and then tweak the config a bit for our setup. To start, run "aws configure". It should look something like this (put in your access and secret keys that you created in the previous step): $ aws configure AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE AWS Secret Access Key [None]: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY Default region name [None]: us-west-2 Default output format [None]: Most folks do that to start. It creates two files: ~/.aws/config ~/.aws/credentials Those two files are important to access AWS via the 'aws' command. '''~/.aws/credentials''' This file contains your access key and secret key, and should not need to be modified after running 'aws configure'. Your same keys can be used to access any roles in any accounts you have access to. '''~/.aws/config''' This file contains some account information you will need to tweak. There are a few ways you could set it up. === Adjusting Configuration for Toil or a Single Role === If you usually use a single role for a single project, or if you need to use Toil with a particular role, you should configure it like this, so that that role is automatically assumed for every operation by default: [default] region = us-west-2 source_profile = default role_arn = arn:aws:iam::422448306679:role/developer mfa_serial = arn:aws:iam::652235167018:mfa/melinda@ucsc.edu duration_seconds = 43200 The "role_arn" line contains the role and account number you are accessing. You can see a list of live account numbers here: [[AWS Account List and Numbers]] Find the account number you need and enter it on the role_arn line, as well as the role name. You will get the role name from the cluster-admin group when you get access. The 'mfa_serial' line contains the identifier for your MFA device. It will always look like '''"arn:aws:iam::652235167018:mfa/[your_iam_username]"'''. The account number listed there will always be "652235167018" because that is the account number of the top level "gi-gateway" account. The "duration_seconds" parameter says that your session token will be 43200 seconds long (12 hours). That means you will only have to authenticate with MFA once every 12 hours. 12 hours is the maximum you can request, although you can specify less than that. This means it won't ask you for MFA every time you run a command for the next 12 hours. Once that is configured, you should be able to use the aws command without any profile specified, and have it automatically assume a role to grant you access: $ aws s3 ls It will ask you for your MFA code and then run the command. Once you enter the MFA code, the token it creates will be valid for 12 hours if you specified "duration_seconds = 43200", or if you omitted that line, the default session duration is one hour, so you can run other 'aws' cli commands without the need to re-authenticate with MFA for the duration of the session. After the session expires, you will need to authenticate via MFA again. === Adjusting Configuration for Multiple Roles === If you have multiple roles that you use equally often, and you don't need to use Toil, you can configure it something like this, with multiple profiles: [default] region = us-west-2 [profile pangenomics-developer] source_profile = default role_arn = arn:aws:iam::422448306679:role/developer mfa_serial = arn:aws:iam::652235167018:mfa/melinda@ucsc.edu duration_seconds = 43200 Once that is configured, you should be able to reference the profile you just created when using the aws command, like so: $ aws s3 ls --profile pangenomics-developer ==Tag Your Resources== When you start using AWS resources (instances, networks, etc), it is very important that you "tag" your resources with the "Owner" tag (note the capital "O"). "Owner" is the key, and the value assigned to it will be your IAM username (i.e. your email address). So, for example, if I spin up an instance, I would tag it during or after creation with something like: Owner = bob@ucsc.edu If you do not tag your instances, '''they will automatically be terminated within 10 minutes.''' Tag your instances especially, but tag every resource you create! This allows us to perform accounting tasks much more easily and allows the Program Managers to know which resources are controlled by who. b84b21296356bfda6b5e104ae5ecff53fed3b5c6 265 260 2023-01-18T19:53:12Z Weiler 3 wikitext text/x-wiki __TOC__ == Getting AWS (Amazon Web Services) Access == The Genomics Institute has a series of AWS Accounts that all support different projects. Often if you become associated with one or more of those projects, you will need access to that account or accounts. The way we are managing AWS IAM Account Access is that we have one AWS account that is the 'top level' account that everyone gets access to, and then, once you log in there, you can "Switch Role" into another sub-account that you are running things in. To get access, you will need your PI or Project Manager to email cluster-admin (cluster-admin@soe.ucsc.edu) asking for an AWS account for you, and also in that email to name the projects you will have access to. The cluster-admin group will contact you with your credentials to login. Once you login, you can change your password if you want to and also you will be able to set up MFA (Multi Factor Authentication) for your account. You will be required to use MFA in order to "Switch Role" into any of the sub-accounts for the projects you are working on. The login URL to use when logging in to the top level account is listed below. The top level account is known as "gi-gateway": https://gi-gateway.signin.aws.amazon.com/console When you login, you '''may''' see a couple error messages on the AWS dashboard saying you don't have access to view certain resources - '''this is normal''', so just ignore the error messages. == Configuring Account Credentials == Once you login to the gi-gateway, you will have very few permissions to do anything there - which is normal, since you will not be working in that account anyway. The gi-gateway account is just there to authenticate you to AWS. '''Changing Your Password''' You can change your password by clicking on your username on the top right of the web browser window, just to the right of the little bell. If your username is melinda@ucsc.edu, for example: * Click '''"melinda@ucsc.edu @ gi-gateway"''' on the top right of your browser window. * Click the '''"My Security Credentials"''' drop-down menu option. * Click the '''"Change Password"''' button to change your password. Note that we have a password strength policy in place, so your password must conform to the following requirements: * Your password must be at least 10 characters long * Your password must contain at least one lowercase letter * Your password must contain at least one non-alphanumeric character * Your password must contain at least one number You will also need to configure MFA on your account before you will be allowed to switch roles into another account. '''Configuring MFA''' To configure MFA (Multi Factor Authentication), the most common way to do it is to use '''Google Authenticator''', which is an app available for Apple and Android based cell phones and mobile devices. The app is free, simply download it from the app store to your cell phone or tablet to get started. Other MFA apps may also work but we have not tested everything out there. Once you have Google Authenticator installed, log into the gi-gateway account using the above URL, then: * Click '''"melinda@ucsc.edu @ gi-gateway"''' on the top right of your browser window (again, melinda@ucsc.edu is an example). * Click the '''"Security Credentials"''' drop-down menu option. * Scroll down to the MFA (Multi-Factor Authentication) section of the page, and click '''"Assign MFA Device"'''. * In the following menu select '''"Authenticator App", and for the device name, use your username (which is your email address used to login)'''. * In the following window click the '''"Show QR Code"''' link, and the MFA QR barcode will appear on your screen. * Open the Google Authenticator app on your mobile device, and click the little "+" symbol in the top right corner of the app to add an account. * You will then need to select "Scan Barcode" in the Google Authenticaor app to continue, and aim your mobile device camera at the QR barcode. * The new account MFA device should then be set up and you should see a 6 digit number with a small timer to the right of it. You must type in one 6 digit code that it displays into your web browser when asked, then wait for the next code to appear after the timer expires, and type that into the second field. It should then inform you that you have successfully associated an MFA device with your account. Once you have associated an MFA device with the 'gi-gateway' Account, '''log out''', then log back in. It will ask for your username and password, and then ask for your MFA code, which you can view by opening Google Authenticator and seeing what code it is displaying at that time. The code changes every 30 seconds or so. '''You must log out first and log back in using MFA in order to be able to switch roles!!!''' == Switching Roles into Another AWS Account == Now that you have configured a password and enabled MFA, you will be allowed to "Switch Roles" into another account such that you can begin work there. The first time you switch roles into an account it will ask you a few questions, but subsequently it will remember which roles you have access to and they will become a menu item you can click on to quickly switch roles. First, you need the name of the account you want to switch to. Select the name from the list at [[AWS Account List and Numbers]]. Let's assume that you want to switch to the 'pangenomics' AWS account, and you have been already granted access to do so by the cluster-admin group. After logging into the 'gi-gateway' account at the URL listed here (same as above): https://gi-gateway.signin.aws.amazon.com/console Do the following to switch roles into the 'pangenomics' account (as an example): * Click '''"melinda@ucsc.edu @ gi-gateway"''' on the top right of your browser window (again, melinda@ucsc.edu is an example). * Click the '''"Switch Role"''' option in the drop-down menu. * In the following menu it will ask you about the role you will be assuming. In our example we will use the following: -Account* = pangenomics -Role* = developer -Display Name = [leave blank, or use a short phrase] -Color = [choose a color for this role] * Then click the "Switch Role" button. If all went well you should be dumped into the 'pangenomics' account, and you should be identified in the top right hand corner of the page as '''"developer @ pangenomics"''', indicating your role and the account you are active in. You can then work as normal in that account. If you have not yet been given access to that role, you will receive an error message and not be allowed to switch roles. '''NOTE:''' When you switch roles, it may dump you into a region that you don't expect it to. Always verify the region you are in by looking at the top right of the web page - it will display your region there. Most of our stuff exists in "Oregon" (us-west-2), but some items appear in other regions on a per-case basis. If you wish to switch context back to the 'gi-gateway' account in order to manage something, or to switch to another role in another account, simple: * Click '''"developer @ pangenomics"''' in the top right corner of the window. * Select '''"Back to melinda@ucsc.edu"''' You will then be sent back to the 'gi-gateway' context, and you can add another role to switch into, manage your credentials and further switch roles. == API Access and Secret Keys == If you require programmatic access to AWS, you will very likely be familiar with the AWS concept of Access Keys and Secret Keys, which can be used by scripts to authenticate yourself to AWS and use the APIs there without using the web console to authenticate. In the past, access keys and secret keys could be used by users with no further authentication. This introduces a security risk, as the management of those keys must be carefully guarded - if anyone gets your keys, they can rack up charges on your AWS account without your knowledge! Using the "Assume Role" mechanism we are now using, Access Keys and Secret Keys can still be created by users '''while logged into the gi-gateway account only'''. Do not try to create keys while you have "Switched Roles" into another account. Keys you create in the top level 'gi-gateway' account will work for you in any sub-account you have access to switch roles to. You will need to do a little more configuration for your keys to work from a UNIX command line however. To set up your access and secret keys for the first time (again, logged into the 'gi-gateway' account only), follow these instructions. Once you log into the gi-gateway web interface, click on your username in the top right corner of the browser window, then click "My Security Credentials". In that screen you will see an "Access Keys" section, and you will have one key listed. Delete that key (using the "Delete" button on the right side of the key), then create a new key using the "Create Access Key" button. It will show you your access and secret key ONCE, so make sure to copy and paste it somewhere. It should be noted that we recommend awscli version 1.16.187 or later, as earlier versions have documented issues with using profiles and MFA related actions. You can determine your version of awscli by doing: aws --version === Entering Base Credentials === Generically, if you plan on using keys for API Access, minimally you will need to configure the "aws" utility and then tweak the config a bit for our setup. To start, run "aws configure". It should look something like this (put in your access and secret keys that you created in the previous step): $ aws configure AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE AWS Secret Access Key [None]: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY Default region name [None]: us-west-2 Default output format [None]: Most folks do that to start. It creates two files: ~/.aws/config ~/.aws/credentials Those two files are important to access AWS via the 'aws' command. '''~/.aws/credentials''' This file contains your access key and secret key, and should not need to be modified after running 'aws configure'. Your same keys can be used to access any roles in any accounts you have access to. '''~/.aws/config''' This file contains some account information you will need to tweak. There are a few ways you could set it up. === Adjusting Configuration for Toil or a Single Role === If you usually use a single role for a single project, or if you need to use Toil with a particular role, you should configure it like this, so that that role is automatically assumed for every operation by default: [default] region = us-west-2 source_profile = default role_arn = arn:aws:iam::422448306679:role/developer mfa_serial = arn:aws:iam::652235167018:mfa/melinda@ucsc.edu duration_seconds = 43200 The "role_arn" line contains the role and account number you are accessing. You can see a list of live account numbers here: [[AWS Account List and Numbers]] Find the account number you need and enter it on the role_arn line, as well as the role name. You will get the role name from the cluster-admin group when you get access. The 'mfa_serial' line contains the identifier for your MFA device. It will always look like '''"arn:aws:iam::652235167018:mfa/[your_iam_username]"'''. The account number listed there will always be "652235167018" because that is the account number of the top level "gi-gateway" account. The "duration_seconds" parameter says that your session token will be 43200 seconds long (12 hours). That means you will only have to authenticate with MFA once every 12 hours. 12 hours is the maximum you can request, although you can specify less than that. This means it won't ask you for MFA every time you run a command for the next 12 hours. Once that is configured, you should be able to use the aws command without any profile specified, and have it automatically assume a role to grant you access: $ aws s3 ls It will ask you for your MFA code and then run the command. Once you enter the MFA code, the token it creates will be valid for 12 hours if you specified "duration_seconds = 43200", or if you omitted that line, the default session duration is one hour, so you can run other 'aws' cli commands without the need to re-authenticate with MFA for the duration of the session. After the session expires, you will need to authenticate via MFA again. === Adjusting Configuration for Multiple Roles === If you have multiple roles that you use equally often, and you don't need to use Toil, you can configure it something like this, with multiple profiles: [default] region = us-west-2 [profile pangenomics-developer] source_profile = default role_arn = arn:aws:iam::422448306679:role/developer mfa_serial = arn:aws:iam::652235167018:mfa/melinda@ucsc.edu duration_seconds = 43200 Once that is configured, you should be able to reference the profile you just created when using the aws command, like so: $ aws s3 ls --profile pangenomics-developer ==Tag Your Resources== When you start using AWS resources (instances, networks, etc), it is very important that you "tag" your resources with the "Owner" tag (note the capital "O"). "Owner" is the key, and the value assigned to it will be your IAM username (i.e. your email address). So, for example, if I spin up an instance, I would tag it during or after creation with something like: Owner = bob@ucsc.edu If you do not tag your instances, '''they will automatically be terminated within 10 minutes.''' Tag your instances especially, but tag every resource you create! This allows us to perform accounting tasks much more easily and allows the Program Managers to know which resources are controlled by who. ddfe7aea7d5afdf33b03ba920acac73f535e7510 How to access the public servers 0 11 267 243 2023-03-01T05:03:06Z Weiler 3 wikitext text/x-wiki == How to Gain Access to the Public Genomics Institute Compute Servers == If you need access to the Genomics Institute compute servers please complete this request form: https://app.smartsheet.com/b/form/a76dbd90ba0240ab9ea9d39b390586ce. There are two parts in this process. 1. For the user, please fill in ALL required fields and submit. 2. For the Sponsor/PI - you will receive an email from Smartsheet. Please fill in all required fields and submit. We will receive your completed request and we will create your account and go over the details via a short zoom meeting with you. == Account and Storage Cost == Costs for having an active UNIX account and for storage (per TB) are listed in this document, under "Genomics Project Support", specifically "Genomics IT Systems User Support per user" and "Genomics Data Storage per TB": https://planning.ucsc.edu/budget/rates-and-assessments/recharge-rates/docs/2021-22-approved-recharge-rates.pdf == Account Expiration == Your UNIX account will have an expiration date associated with it after creation. If your account was created in January, February or March, then your account expiration date will be July 1st of the '''current year'''. If the account was created after March, then your expiration date will be July 1st of the '''following year'''. You will receive notice by email when your account is about to expire. To renew, simply ask the PI that sponsored you (which will be included in the notice) to email '''cluster-admin@soe.ucsc.edu''' requesting that your account to be renewed for another year. If your account expires, the account will be suspended and you will no longer be able to login or view any data you may have in our systems. Any automated scripts (owned by you) that run via cron or other mechanisms will cease to function. == Server Types and Management== You can ssh into our public compute servers via SSH: '''courtyard.gi.ucsc.edu''': 1TB RAM, 64 cores, 672GB local scratch space, CentOS 7.9 '''plaza.gi.ucsc.edu''': 1TB RAM, 64 cores, 672GB local scratch space, CentOS 7.9 '''park.gi.ucsc.edu''': 256GB RAM, 32 cores, 5TB local scratch space, Ubuntu 22.04.1 These servers are managed by the Genomics Institute Cluster Admin group. If you need software installed on them, please make your request by emailing cluster-admin@soe.ucsc.edu. == Storage == These servers mount two types of storage; home directories and group storage directories. Your home directory will be located as "/public/home/username" and has a 30GB quota. The group storage directories are created per PI, and each group directory has a 15TB quota. For example, if David Haussler is the PI that you report to directly, then the directory would exist as /public/groups/hausslerlab. Request access to that group directory and you will then be able to write to it. Each of those group directories are shared by the lab it belongs to, so you must be wary of everyone's data usage and share the 15TB available per group accordingly. On the compute servers you can check your group's current quota usage by using the '/usr/bin/viewquota' command. You can only check the quota of a group you are part of (you would be a member of the UNIX group of the same name). If you wanted to check the quota usage of /public/groups/hausslerlab for example, you would do: $ viewquota hausslerlab Project quota on /export (/dev/mapper/export) Project ID Used Soft Hard Warn/Grace ---------- --------------------------------- hausslerlab 1.8T 15T 16T 00 [------] == Actually Doing Work and Computing == When doing research, running jobs and the like, please be careful of your resource consumption on the server you are on. Don't run too many threads or cores at once if such a thing overruns the RAM available or the disk IO available. If you are not sure of your potential RAM, CPU or disk impact, start small with one or two processes and work your way up from there. Also, before running your stuff, check what else is already happening on the server by using the 'top' command to see who else and what else is running and what kind of resources are already being consumed. If, after starting a process, you realize that the server slows down considerably or becomes unusable, kill your processes and re-evaluate what you need to make things work. These servers are shared resources - be a good neighbor! == Serving Files to the Public via the Web == If you want to setup a web page on courtyard, or serve files over HTTP from there, do this: mkdir /public/home/''your_username''/public_html chmod 755 /public/home/''your_username''/public_html Put data in the public_html directory. The URL will be: http://public.gi.ucsc.edu/''~username''/ == /scratch Space on the Servers == Each server will generally have a local /scratch filesystem that you can use to store temporary files. '''BE ADVISED''' that /scratch is not backed up, and the data there could disappear in the event of a disk failure or anything else. Do not store important data there. If it is important, it should be moved somewhere else very soon after creation. 48cd5601c042ffaa6a7cac86796728e1aa287676 268 267 2023-03-01T05:04:40Z Weiler 3 wikitext text/x-wiki == How to Gain Access to the Public Genomics Institute Compute Servers == If you need access to the Genomics Institute compute servers please complete this request form: https://app.smartsheet.com/b/form/a76dbd90ba0240ab9ea9d39b390586ce. There are two parts in this process. 1. For the user, please fill in ALL required fields and submit. 2. For the Sponsor/PI - you will receive an email from Smartsheet. Please fill in all required fields and submit. We will receive your completed request and we will create your account and go over the details via a short zoom meeting with you. == Account and Storage Cost == Costs for having an active UNIX account and for storage (per TB) are listed in this document, under "Genomics Project Support", specifically "Genomics IT Systems User Support per user" and "Genomics Data Storage per TB": https://planning.ucsc.edu/budget/rates-and-assessments/recharge-rates/docs/2021-22-approved-recharge-rates.pdf == Account Expiration == Your UNIX account will have an expiration date associated with it after creation, as requested by your sponsor. Please take note of this expiration date when your account is created. You will receive notice by email when your account is about to expire. To renew, simply ask the PI that sponsored you (which will be included in the notice) to email '''cluster-admin@soe.ucsc.edu''' requesting that your account to be renewed for another year, or any other requested amount of time. If your account expires, the account will be suspended and you will no longer be able to login or view any data you may have in our systems. Any automated scripts (owned by you) that run via cron or other mechanisms will cease to function. == Server Types and Management== You can ssh into our public compute servers via SSH: '''courtyard.gi.ucsc.edu''': 1TB RAM, 64 cores, 672GB local scratch space, CentOS 7.9 '''plaza.gi.ucsc.edu''': 1TB RAM, 64 cores, 672GB local scratch space, CentOS 7.9 '''park.gi.ucsc.edu''': 256GB RAM, 32 cores, 5TB local scratch space, Ubuntu 22.04.1 These servers are managed by the Genomics Institute Cluster Admin group. If you need software installed on them, please make your request by emailing cluster-admin@soe.ucsc.edu. == Storage == These servers mount two types of storage; home directories and group storage directories. Your home directory will be located as "/public/home/username" and has a 30GB quota. The group storage directories are created per PI, and each group directory has a 15TB quota. For example, if David Haussler is the PI that you report to directly, then the directory would exist as /public/groups/hausslerlab. Request access to that group directory and you will then be able to write to it. Each of those group directories are shared by the lab it belongs to, so you must be wary of everyone's data usage and share the 15TB available per group accordingly. On the compute servers you can check your group's current quota usage by using the '/usr/bin/viewquota' command. You can only check the quota of a group you are part of (you would be a member of the UNIX group of the same name). If you wanted to check the quota usage of /public/groups/hausslerlab for example, you would do: $ viewquota hausslerlab Project quota on /export (/dev/mapper/export) Project ID Used Soft Hard Warn/Grace ---------- --------------------------------- hausslerlab 1.8T 15T 16T 00 [------] == Actually Doing Work and Computing == When doing research, running jobs and the like, please be careful of your resource consumption on the server you are on. Don't run too many threads or cores at once if such a thing overruns the RAM available or the disk IO available. If you are not sure of your potential RAM, CPU or disk impact, start small with one or two processes and work your way up from there. Also, before running your stuff, check what else is already happening on the server by using the 'top' command to see who else and what else is running and what kind of resources are already being consumed. If, after starting a process, you realize that the server slows down considerably or becomes unusable, kill your processes and re-evaluate what you need to make things work. These servers are shared resources - be a good neighbor! == Serving Files to the Public via the Web == If you want to setup a web page on courtyard, or serve files over HTTP from there, do this: mkdir /public/home/''your_username''/public_html chmod 755 /public/home/''your_username''/public_html Put data in the public_html directory. The URL will be: http://public.gi.ucsc.edu/''~username''/ == /scratch Space on the Servers == Each server will generally have a local /scratch filesystem that you can use to store temporary files. '''BE ADVISED''' that /scratch is not backed up, and the data there could disappear in the event of a disk failure or anything else. Do not store important data there. If it is important, it should be moved somewhere else very soon after creation. 3c224d836cec0f4fb14522ae9cbe7d13f160e2c4 Overview of using Slurm 0 32 270 2023-03-09T01:46:33Z Weiler 3 Created page with "When using Slurm, you will need to log into the Slurm head node (currently phoenix-01.gi.ucsc.edu, a one node cluster at the moment). Once you have ssh'd in there, you can ex..." wikitext text/x-wiki When using Slurm, you will need to log into the Slurm head node (currently phoenix-01.gi.ucsc.edu, a one node cluster at the moment). Once you have ssh'd in there, you can execute slurm batch or interactive commands. == Submit a Slurm Batch Job == In order to submit a Slurm batch job list, you will need to create a directory that you will have read and write access to on all the nodes (which will often be a shared space). Let's say I have a batch named "experiment-1". I would create that directory in my groups area: % mkdir /public/groups/clusteradmin/weiler/slurm-jobs/experiment-1 % cd /public/groups/clusteradmin/weiler/slurm-jobs/experiment-1 Then you will need to create your job submission batch file. It will look something like this. My file is called 'slurm-test.sh': % vim slurm-test.sh Then populate the file as necessary: #!/bin/bash # Job name: #SBATCH --job-name=weiler_test # # Account: #SBATCH --account=weiler # # Partition - This is the queue it goes in: #SBATCH --partition=batch # # Where to send email (optional) #SBATCH --mail-user=weiler@ucsc.edu # # Number of nodes you need per job: #SBATCH --nodes=1 # # Number of tasks (one for each GPU desired for use case) (example): #SBATCH --ntasks=1 # # Processors per task: # At least eight times the number of GPUs needed for nVidia RTX A5500 #SBATCH --cpus-per-task=1 # # Number of GPUs, this can be in the format of "gpu:[1-4]", or "gpu:K80:[1-4] with the type included (optional) #SBATCH --gres=gpu:1 # # Standard output and error log #SBATCH --output=serial_test_%j.log # # Wall clock limit in hrs:min:sec: #SBATCH --time=00:00:30 # ## Command(s) to run (example): pwd; hostname; date module load python echo "Running test script on a single CPU core" python /public/groups/clusteradmin/weiler/slurm-jobs/experiment-1/mytest.py date Keep the "SBATCH" lines commented, the scheduler will read them anyway. If you don't need a particular option, just don't include it in the file. To submit the batch job: % sbatch slurm-test.sh Submitted batch job 7 The job(s) will then be scheduled. You can see the state of the queue as such: % squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 7 batch weiler_t weiler R 0:07 1 phoenix-01 The job will output any STDOUT or STDERR in the directory you launched the job from. Other than that, it will do whatever the job does, even if there is no STDOUT. c5222c7d5b2434a4c0d7d425ad0c253b062619c3 274 270 2023-03-09T01:51:24Z Weiler 3 wikitext text/x-wiki When using Slurm, you will need to log into the Slurm head node (currently phoenix-01.gi.ucsc.edu, a one node cluster at the moment). Once you have ssh'd in there, you can execute slurm batch or interactive commands. == Submit a Slurm Batch Job == In order to submit a Slurm batch job list, you will need to create a directory that you will have read and write access to on all the nodes (which will often be a shared space). Let's say I have a batch named "experiment-1". I would create that directory in my groups area: % mkdir /public/groups/clusteradmin/weiler/slurm-jobs/experiment-1 % cd /public/groups/clusteradmin/weiler/slurm-jobs/experiment-1 Then you will need to create your job submission batch file. It will look something like this. My file is called 'slurm-test.sh': % vim slurm-test.sh Then populate the file as necessary: #!/bin/bash # Job name: #SBATCH --job-name=weiler_test # # Account: #SBATCH --account=weiler # # Partition - This is the queue it goes in: #SBATCH --partition=batch # # Where to send email (optional) #SBATCH --mail-user=weiler@ucsc.edu # # Number of nodes you need per job: #SBATCH --nodes=1 # # Memory needed for the jobs. Try very hard to make this accurate. DEFAULT = 4gb #SBATCH --mem=4gb # # Number of tasks (one for each GPU desired for use case) (example): #SBATCH --ntasks=1 # # Processors per task: # At least eight times the number of GPUs needed for nVidia RTX A5500 #SBATCH --cpus-per-task=1 # # Number of GPUs, this can be in the format of "gpu:[1-4]", or "gpu:K80:[1-4] with the type included (optional) #SBATCH --gres=gpu:1 # # Standard output and error log #SBATCH --output=serial_test_%j.log # # Wall clock limit in hrs:min:sec: #SBATCH --time=00:00:30 # ## Command(s) to run (example): pwd; hostname; date module load python echo "Running test script on a single CPU core" python /public/groups/clusteradmin/weiler/slurm-jobs/experiment-1/mytest.py date Keep the "SBATCH" lines commented, the scheduler will read them anyway. If you don't need a particular option, just don't include it in the file. To submit the batch job: % sbatch slurm-test.sh Submitted batch job 7 The job(s) will then be scheduled. You can see the state of the queue as such: % squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 7 batch weiler_t weiler R 0:07 1 phoenix-01 The job will output any STDOUT or STDERR in the directory you launched the job from. Other than that, it will do whatever the job does, even if there is no STDOUT. f559d6267c912315a72a5dab42dc631d7e5e73b6 278 274 2023-03-09T01:59:04Z Weiler 3 wikitext text/x-wiki When using Slurm, you will need to log into the Slurm head node (currently phoenix-01.gi.ucsc.edu, a one node cluster at the moment). Once you have ssh'd in there, you can execute slurm batch or interactive commands. == Submit a Slurm Batch Job == In order to submit a Slurm batch job list, you will need to create a directory that you will have read and write access to on all the nodes (which will often be a shared space). Let's say I have a batch named "experiment-1". I would create that directory in my groups area: % mkdir /public/groups/clusteradmin/weiler/slurm-jobs/experiment-1 % cd /public/groups/clusteradmin/weiler/slurm-jobs/experiment-1 Then you will need to create your job submission batch file. It will look something like this. My file is called 'slurm-test.sh': % vim slurm-test.sh Then populate the file as necessary: #!/bin/bash # Job name: #SBATCH --job-name=weiler_test # # Partition - This is the queue it goes in: #SBATCH --partition=batch # # Where to send email (optional) #SBATCH --mail-user=weiler@ucsc.edu # # Number of nodes you need per job: #SBATCH --nodes=1 # # Memory needed for the jobs. Try very hard to make this accurate. DEFAULT = 4gb #SBATCH --mem=4gb # # Number of tasks (one for each GPU desired for use case) (example): #SBATCH --ntasks=1 # # Processors per task: # At least eight times the number of GPUs needed for nVidia RTX A5500 #SBATCH --cpus-per-task=1 # # Number of GPUs, this can be in the format of "gpu:[1-4]", or "gpu:K80:[1-4] with the type included (optional) #SBATCH --gres=gpu:1 # # Standard output and error log #SBATCH --output=serial_test_%j.log # # Wall clock limit in hrs:min:sec: #SBATCH --time=00:00:30 # ## Command(s) to run (example): pwd; hostname; date module load python echo "Running test script on a single CPU core" python /public/groups/clusteradmin/weiler/slurm-jobs/experiment-1/mytest.py date Keep the "SBATCH" lines commented, the scheduler will read them anyway. If you don't need a particular option, just don't include it in the file. To submit the batch job: % sbatch slurm-test.sh Submitted batch job 7 The job(s) will then be scheduled. You can see the state of the queue as such: % squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 7 batch weiler_t weiler R 0:07 1 phoenix-01 The job will output any STDOUT or STDERR in the directory you launched the job from. Other than that, it will do whatever the job does, even if there is no STDOUT. == CGROUPS and Resource Management == Our installation of Slurm will utilize Linux CGROUPS, which puts a hard resource cap on jobs. If you define that your job will need 4GB of RAM, and it uses 5GB, it will fail with an OOM exception. Likewise with CPU or GPU resources; if your job ends up using more than you specify, it will fail. This is to keep the nodes from crashing from runaway jobs that use more resources than you think they will. So... TEST YOUR JOBS! Find our how much in resources a single job needs before you launch 100 of them. 64b9a4293b6eae65b31c81ba8f438c47222b8ecc 279 278 2023-03-09T02:07:56Z Weiler 3 wikitext text/x-wiki When using Slurm, you will need to log into the Slurm head node (currently phoenix-01.gi.ucsc.edu, a one node cluster at the moment). Once you have ssh'd in there, you can execute slurm batch or interactive commands. == Submit a Slurm Batch Job == In order to submit a Slurm batch job list, you will need to create a directory that you will have read and write access to on all the nodes (which will often be a shared space). Let's say I have a batch named "experiment-1". I would create that directory in my groups area: % mkdir /public/groups/clusteradmin/weiler/slurm-jobs/experiment-1 % cd /public/groups/clusteradmin/weiler/slurm-jobs/experiment-1 Then you will need to create your job submission batch file. It will look something like this. My file is called 'slurm-test.sh': % vim slurm-test.sh Then populate the file as necessary: #!/bin/bash # Job name: #SBATCH --job-name=weiler_test # # Partition - This is the queue it goes in: #SBATCH --partition=batch # # Where to send email (optional) #SBATCH --mail-user=weiler@ucsc.edu # # Number of nodes you need per job: #SBATCH --nodes=1 # # Memory needed for the jobs. Try very hard to make this accurate. DEFAULT = 4gb #SBATCH --mem=4gb # # Number of tasks (one for each GPU desired for use case) (example): #SBATCH --ntasks=1 # # Processors per task: # At least eight times the number of GPUs needed for nVidia RTX A5500 #SBATCH --cpus-per-task=1 # # Number of GPUs, this can be in the format of "gpu:[1-4]", or "gpu:K80:[1-4] with the type included (optional) #SBATCH --gres=gpu:1 # # Standard output and error log #SBATCH --output=serial_test_%j.log # # Wall clock limit in hrs:min:sec: #SBATCH --time=00:00:30 # ## Command(s) to run (example): pwd; hostname; date module load python echo "Running test script on a single CPU core" python /public/groups/clusteradmin/weiler/slurm-jobs/experiment-1/mytest.py date Keep the "SBATCH" lines commented, the scheduler will read them anyway. If you don't need a particular option, just don't include it in the file. To submit the batch job: % sbatch slurm-test.sh Submitted batch job 7 The job(s) will then be scheduled. You can see the state of the queue as such: % squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 7 batch weiler_t weiler R 0:07 1 phoenix-01 The job will output any STDOUT or STDERR in the directory you launched the job from. Other than that, it will do whatever the job does, even if there is no STDOUT. == Launching Several Jobs at Once == You can launch many jobs at once using the $SLURM_ARRAY_TASK_ID variable. Add something like the following to your batch submission file: #SBATCH --array=0-31 #SBATCH --output=array_job_%A_task_%a.out #SBATCH --error=array_job_%A_task_%a.err ## Command(s) to run: echo "I am task $SLURM_ARRAY_TASK_ID" You could have a bunch of jobs, one script for each, in a directory: /mydir/big_run/job-0.sh /mydir/big_run/job-1.sh /mydir/big_run/job-2.sh and define them as such in your batch file: #SBATCH --array=0-2 #SBATCH --output=array_job_%A_task_%a.out #SBATCH --error=array_job_%A_task_%a.err ## Command(s) to run: /mydir/big_run/job-$SLURM_ARRAY_TASK_ID.sh == CGROUPS and Resource Management == Our installation of Slurm will utilize Linux CGROUPS, which puts a hard resource cap on jobs. If you define that your job will need 4GB of RAM, and it uses 5GB, it will fail with an OOM exception. Likewise with CPU or GPU resources; if your job ends up using more than you specify, it will fail. This is to keep the nodes from crashing from runaway jobs that use more resources than you think they will. So... TEST YOUR JOBS! Find our how much in resources a single job needs before you launch 100 of them. 032c7798fe1f93b6d3c5e4f0eb893bb898d685dd 280 279 2023-03-09T03:23:42Z Weiler 3 wikitext text/x-wiki When using Slurm, you will need to log into the Slurm head node (currently phoenix-01.gi.ucsc.edu, a one node cluster at the moment). Once you have ssh'd in there, you can execute slurm batch or interactive commands. == Submit a Slurm Batch Job == In order to submit a Slurm batch job list, you will need to create a directory that you will have read and write access to on all the nodes (which will often be a shared space). Let's say I have a batch named "experiment-1". I would create that directory in my groups area: % mkdir /public/groups/clusteradmin/weiler/slurm-jobs/experiment-1 % cd /public/groups/clusteradmin/weiler/slurm-jobs/experiment-1 Then you will need to create your job submission batch file. It will look something like this. My file is called 'slurm-test.sh': % vim slurm-test.sh Then populate the file as necessary: #!/bin/bash # Job name: #SBATCH --job-name=weiler_test # # Partition - This is the queue it goes in: #SBATCH --partition=batch # # Where to send email (optional) #SBATCH --mail-user=weiler@ucsc.edu # # Number of nodes you need per job: #SBATCH --nodes=1 # # Memory needed for the jobs. Try very hard to make this accurate. DEFAULT = 4gb #SBATCH --mem=4gb # # Number of tasks (one for each GPU desired for use case) (example): #SBATCH --ntasks=1 # # Processors per task: # At least eight times the number of GPUs needed for nVidia RTX A5500 #SBATCH --cpus-per-task=1 # # Number of GPUs, this can be in the format of "gpu:[1-4]", or "gpu:K80:[1-4] with the type included (optional) #SBATCH --gres=gpu:1 # # Standard output and error log #SBATCH --output=serial_test_%j.log # # Wall clock limit in hrs:min:sec: #SBATCH --time=00:00:30 # ## Command(s) to run (example): pwd; hostname; date module load python echo "Running test script on a single CPU core" python /public/groups/clusteradmin/weiler/slurm-jobs/experiment-1/mytest.py date Keep the "SBATCH" lines commented, the scheduler will read them anyway. If you don't need a particular option, just don't include it in the file. To submit the batch job: % sbatch slurm-test.sh Submitted batch job 7 The job(s) will then be scheduled. You can see the state of the queue as such: % squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 7 batch weiler_t weiler R 0:07 1 phoenix-01 The job will output any STDOUT or STDERR in the directory you launched the job from. Other than that, it will do whatever the job does, even if there is no STDOUT. == Launching Several Jobs at Once == You can launch many jobs at once using the $SLURM_ARRAY_TASK_ID variable. Add something like the following to your batch submission file: #SBATCH --array=0-31 #SBATCH --output=array_job_%A_task_%a.out #SBATCH --error=array_job_%A_task_%a.err ## Command(s) to run: echo "I am task $SLURM_ARRAY_TASK_ID" == CGROUPS and Resource Management == Our installation of Slurm will utilize Linux CGROUPS, which puts a hard resource cap on jobs. If you define that your job will need 4GB of RAM, and it uses 5GB, it will fail with an OOM exception. Likewise with CPU or GPU resources; if your job ends up using more than you specify, it will fail. This is to keep the nodes from crashing from runaway jobs that use more resources than you think they will. So... TEST YOUR JOBS! Find our how much in resources a single job needs before you launch 100 of them. 14bf4897c7d93e967f7b15f1c2b0adc7150c2cca 289 280 2023-03-09T03:57:44Z Weiler 3 /* CGROUPS and Resource Management */ wikitext text/x-wiki When using Slurm, you will need to log into the Slurm head node (currently phoenix-01.gi.ucsc.edu, a one node cluster at the moment). Once you have ssh'd in there, you can execute slurm batch or interactive commands. == Submit a Slurm Batch Job == In order to submit a Slurm batch job list, you will need to create a directory that you will have read and write access to on all the nodes (which will often be a shared space). Let's say I have a batch named "experiment-1". I would create that directory in my groups area: % mkdir /public/groups/clusteradmin/weiler/slurm-jobs/experiment-1 % cd /public/groups/clusteradmin/weiler/slurm-jobs/experiment-1 Then you will need to create your job submission batch file. It will look something like this. My file is called 'slurm-test.sh': % vim slurm-test.sh Then populate the file as necessary: #!/bin/bash # Job name: #SBATCH --job-name=weiler_test # # Partition - This is the queue it goes in: #SBATCH --partition=batch # # Where to send email (optional) #SBATCH --mail-user=weiler@ucsc.edu # # Number of nodes you need per job: #SBATCH --nodes=1 # # Memory needed for the jobs. Try very hard to make this accurate. DEFAULT = 4gb #SBATCH --mem=4gb # # Number of tasks (one for each GPU desired for use case) (example): #SBATCH --ntasks=1 # # Processors per task: # At least eight times the number of GPUs needed for nVidia RTX A5500 #SBATCH --cpus-per-task=1 # # Number of GPUs, this can be in the format of "gpu:[1-4]", or "gpu:K80:[1-4] with the type included (optional) #SBATCH --gres=gpu:1 # # Standard output and error log #SBATCH --output=serial_test_%j.log # # Wall clock limit in hrs:min:sec: #SBATCH --time=00:00:30 # ## Command(s) to run (example): pwd; hostname; date module load python echo "Running test script on a single CPU core" python /public/groups/clusteradmin/weiler/slurm-jobs/experiment-1/mytest.py date Keep the "SBATCH" lines commented, the scheduler will read them anyway. If you don't need a particular option, just don't include it in the file. To submit the batch job: % sbatch slurm-test.sh Submitted batch job 7 The job(s) will then be scheduled. You can see the state of the queue as such: % squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 7 batch weiler_t weiler R 0:07 1 phoenix-01 The job will output any STDOUT or STDERR in the directory you launched the job from. Other than that, it will do whatever the job does, even if there is no STDOUT. == Launching Several Jobs at Once == You can launch many jobs at once using the $SLURM_ARRAY_TASK_ID variable. Add something like the following to your batch submission file: #SBATCH --array=0-31 #SBATCH --output=array_job_%A_task_%a.out #SBATCH --error=array_job_%A_task_%a.err ## Command(s) to run: echo "I am task $SLURM_ARRAY_TASK_ID" == CGROUPS and Resource Management == Our installation of Slurm will utilize Linux CGROUPS, which puts a hard resource cap on jobs. If you define that your job will need 4GB of RAM, and it uses 5GB, it will fail with an OOM exception. Likewise with CPU or GPU resources; if your job ends up using more than you specify, it will fail. Or the TimeThis is to keep the nodes from crashing from runaway jobs that use more resources than you think they will. So... TEST YOUR JOBS! Find our how much in resources a single job needs before you launch 100 of them. 4e9b2374c1601ca96d496e41003d82605b4c068f 290 289 2023-03-09T03:58:46Z Weiler 3 /* CGROUPS and Resource Management */ wikitext text/x-wiki When using Slurm, you will need to log into the Slurm head node (currently phoenix-01.gi.ucsc.edu, a one node cluster at the moment). Once you have ssh'd in there, you can execute slurm batch or interactive commands. == Submit a Slurm Batch Job == In order to submit a Slurm batch job list, you will need to create a directory that you will have read and write access to on all the nodes (which will often be a shared space). Let's say I have a batch named "experiment-1". I would create that directory in my groups area: % mkdir /public/groups/clusteradmin/weiler/slurm-jobs/experiment-1 % cd /public/groups/clusteradmin/weiler/slurm-jobs/experiment-1 Then you will need to create your job submission batch file. It will look something like this. My file is called 'slurm-test.sh': % vim slurm-test.sh Then populate the file as necessary: #!/bin/bash # Job name: #SBATCH --job-name=weiler_test # # Partition - This is the queue it goes in: #SBATCH --partition=batch # # Where to send email (optional) #SBATCH --mail-user=weiler@ucsc.edu # # Number of nodes you need per job: #SBATCH --nodes=1 # # Memory needed for the jobs. Try very hard to make this accurate. DEFAULT = 4gb #SBATCH --mem=4gb # # Number of tasks (one for each GPU desired for use case) (example): #SBATCH --ntasks=1 # # Processors per task: # At least eight times the number of GPUs needed for nVidia RTX A5500 #SBATCH --cpus-per-task=1 # # Number of GPUs, this can be in the format of "gpu:[1-4]", or "gpu:K80:[1-4] with the type included (optional) #SBATCH --gres=gpu:1 # # Standard output and error log #SBATCH --output=serial_test_%j.log # # Wall clock limit in hrs:min:sec: #SBATCH --time=00:00:30 # ## Command(s) to run (example): pwd; hostname; date module load python echo "Running test script on a single CPU core" python /public/groups/clusteradmin/weiler/slurm-jobs/experiment-1/mytest.py date Keep the "SBATCH" lines commented, the scheduler will read them anyway. If you don't need a particular option, just don't include it in the file. To submit the batch job: % sbatch slurm-test.sh Submitted batch job 7 The job(s) will then be scheduled. You can see the state of the queue as such: % squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 7 batch weiler_t weiler R 0:07 1 phoenix-01 The job will output any STDOUT or STDERR in the directory you launched the job from. Other than that, it will do whatever the job does, even if there is no STDOUT. == Launching Several Jobs at Once == You can launch many jobs at once using the $SLURM_ARRAY_TASK_ID variable. Add something like the following to your batch submission file: #SBATCH --array=0-31 #SBATCH --output=array_job_%A_task_%a.out #SBATCH --error=array_job_%A_task_%a.err ## Command(s) to run: echo "I am task $SLURM_ARRAY_TASK_ID" == CGROUPS and Resource Management == Our installation of Slurm will utilize Linux CGROUPS, which puts a hard resource cap on jobs. If you define that your job will need 4GB of RAM, and it uses 5GB, it will fail with an OOM exception. Likewise with CPU or GPU resources; if your job ends up using more than you specify, it will fail. Or the "--time" batch file option, your job will fail if it takes longer than what you specify there. This is to keep the nodes from crashing from runaway jobs that use more resources than you think they will. So... TEST YOUR JOBS! Find our how much in resources a single job needs before you launch 100 of them. 3d122dc3c9902a438c91edc3aff193bbcaa7aea7 291 290 2023-04-09T21:17:27Z Weiler 3 wikitext text/x-wiki When using Slurm, you will need to log into the Slurm head node (currently phoenix-01.gi.ucsc.edu, a one node cluster at the moment). Once you have ssh'd in there, you can execute slurm batch or interactive commands. == Submit a Slurm Batch Job == In order to submit a Slurm batch job list, you will need to create a directory that you will have read and write access to on all the nodes (which will often be a shared space). Let's say I have a batch named "experiment-1". I would create that directory in my groups area: % mkdir /public/groups/clusteradmin/weiler/slurm-jobs/experiment-1 % cd /public/groups/clusteradmin/weiler/slurm-jobs/experiment-1 Then you will need to create your job submission batch file. It will look something like this. My file is called 'slurm-test.sh': % vim slurm-test.sh Then populate the file as necessary: #!/bin/bash # Job name: #SBATCH --job-name=weiler_test # # Partition - This is the queue it goes in: #SBATCH --partition=batch # # Where to send email (optional) #SBATCH --mail-user=weiler@ucsc.edu # # Number of nodes you need per job: #SBATCH --nodes=1 # # Memory needed for the jobs. Try very hard to make this accurate. DEFAULT = 4gb #SBATCH --mem=4gb # # Number of tasks (one for each GPU desired for use case) (example): #SBATCH --ntasks=1 # # Processors per task: # At least eight times the number of GPUs needed for nVidia RTX A5500 #SBATCH --cpus-per-task=1 # # Number of GPUs, this can be in the format of "gpu:[1-4]", or "gpu:K80:[1-4] with the type included (optional) #SBATCH --gres=gpu:1 # # Standard output and error log #SBATCH --output=serial_test_%j.log # # Wall clock limit in hrs:min:sec: #SBATCH --time=00:00:30 # ## Command(s) to run (example): pwd; hostname; date module load python echo "Running test script on a single CPU core" python /private/groups/clusteradmin/weiler/slurm-jobs/experiment-1/mytest.py date Keep the "SBATCH" lines commented, the scheduler will read them anyway. If you don't need a particular option, just don't include it in the file. To submit the batch job: % sbatch slurm-test.sh Submitted batch job 7 The job(s) will then be scheduled. You can see the state of the queue as such: % squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 7 batch weiler_t weiler R 0:07 1 phoenix-01 The job will output any STDOUT or STDERR in the directory you launched the job from. Other than that, it will do whatever the job does, even if there is no STDOUT. == Launching Several Jobs at Once == You can launch many jobs at once using the $SLURM_ARRAY_TASK_ID variable. Add something like the following to your batch submission file: #SBATCH --array=0-31 #SBATCH --output=array_job_%A_task_%a.out #SBATCH --error=array_job_%A_task_%a.err ## Command(s) to run: echo "I am task $SLURM_ARRAY_TASK_ID" == CGROUPS and Resource Management == Our installation of Slurm will utilize Linux CGROUPS, which puts a hard resource cap on jobs. If you define that your job will need 4GB of RAM, and it uses 5GB, it will fail with an OOM exception. Likewise with CPU or GPU resources; if your job ends up using more than you specify, it will fail. Or the "--time" batch file option, your job will fail if it takes longer than what you specify there. This is to keep the nodes from crashing from runaway jobs that use more resources than you think they will. So... TEST YOUR JOBS! Find our how much in resources a single job needs before you launch 100 of them. 0e55e0a53717d938517b2a14214eae40646e883f 292 291 2023-04-09T21:19:31Z Weiler 3 wikitext text/x-wiki When using Slurm, you will need to log into the Slurm head node (currently phoenix-01.gi.ucsc.edu, a one node cluster at the moment). Once you have ssh'd in there, you can execute slurm batch or interactive commands. == Submit a Slurm Batch Job == In order to submit a Slurm batch job list, you will need to create a directory that you will have read and write access to on all the nodes (which will often be a shared space). Let's say I have a batch named "experiment-1". I would create that directory in my groups area: % mkdir -p /public/groups/clusteradmin/weiler/slurm-jobs/experiment-1 % cd /public/groups/clusteradmin/weiler/slurm-jobs/experiment-1 Then you will need to create your job submission batch file. It will look something like this. My file is called 'slurm-test.sh': % vim slurm-test.sh Then populate the file as necessary: #!/bin/bash # Job name: #SBATCH --job-name=weiler_test # # Partition - This is the queue it goes in: #SBATCH --partition=batch # # Where to send email (optional) #SBATCH --mail-user=weiler@ucsc.edu # # Number of nodes you need per job: #SBATCH --nodes=1 # # Memory needed for the jobs. Try very hard to make this accurate. DEFAULT = 4gb #SBATCH --mem=4gb # # Number of tasks (one for each GPU desired for use case) (example): #SBATCH --ntasks=1 # # Processors per task: # At least eight times the number of GPUs needed for nVidia RTX A5500 #SBATCH --cpus-per-task=1 # # Number of GPUs, this can be in the format of "gpu:[1-4]", or "gpu:K80:[1-4] with the type included (optional) #SBATCH --gres=gpu:1 # # Standard output and error log #SBATCH --output=serial_test_%j.log # # Wall clock limit in hrs:min:sec: #SBATCH --time=00:00:30 # ## Command(s) to run (example): pwd; hostname; date module load python echo "Running test script on a single CPU core" python /private/groups/clusteradmin/weiler/slurm-jobs/experiment-1/mytest.py date Keep the "SBATCH" lines commented, the scheduler will read them anyway. If you don't need a particular option, just don't include it in the file. To submit the batch job: % sbatch slurm-test.sh Submitted batch job 7 The job(s) will then be scheduled. You can see the state of the queue as such: % squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 7 batch weiler_t weiler R 0:07 1 phoenix-01 The job will output any STDOUT or STDERR in the directory you launched the job from. Other than that, it will do whatever the job does, even if there is no STDOUT. == Launching Several Jobs at Once == You can launch many jobs at once using the $SLURM_ARRAY_TASK_ID variable. Add something like the following to your batch submission file: #SBATCH --array=0-31 #SBATCH --output=array_job_%A_task_%a.out #SBATCH --error=array_job_%A_task_%a.err ## Command(s) to run: echo "I am task $SLURM_ARRAY_TASK_ID" == CGROUPS and Resource Management == Our installation of Slurm will utilize Linux CGROUPS, which puts a hard resource cap on jobs. If you define that your job will need 4GB of RAM, and it uses 5GB, it will fail with an OOM exception. Likewise with CPU or GPU resources; if your job ends up using more than you specify, it will fail. Or the "--time" batch file option, your job will fail if it takes longer than what you specify there. This is to keep the nodes from crashing from runaway jobs that use more resources than you think they will. So... TEST YOUR JOBS! Find our how much in resources a single job needs before you launch 100 of them. 4a694445139e7f54efee12a1f03e7707e7ae7d7a 293 292 2023-04-09T21:19:50Z Weiler 3 wikitext text/x-wiki When using Slurm, you will need to log into the Slurm head node (currently phoenix-01.gi.ucsc.edu, a one node cluster at the moment). Once you have ssh'd in there, you can execute slurm batch or interactive commands. == Submit a Slurm Batch Job == In order to submit a Slurm batch job list, you will need to create a directory that you will have read and write access to on all the nodes (which will often be a shared space). Let's say I have a batch named "experiment-1". I would create that directory in my groups area: % mkdir -p /private/groups/clusteradmin/weiler/slurm-jobs/experiment-1 % cd /private/groups/clusteradmin/weiler/slurm-jobs/experiment-1 Then you will need to create your job submission batch file. It will look something like this. My file is called 'slurm-test.sh': % vim slurm-test.sh Then populate the file as necessary: #!/bin/bash # Job name: #SBATCH --job-name=weiler_test # # Partition - This is the queue it goes in: #SBATCH --partition=batch # # Where to send email (optional) #SBATCH --mail-user=weiler@ucsc.edu # # Number of nodes you need per job: #SBATCH --nodes=1 # # Memory needed for the jobs. Try very hard to make this accurate. DEFAULT = 4gb #SBATCH --mem=4gb # # Number of tasks (one for each GPU desired for use case) (example): #SBATCH --ntasks=1 # # Processors per task: # At least eight times the number of GPUs needed for nVidia RTX A5500 #SBATCH --cpus-per-task=1 # # Number of GPUs, this can be in the format of "gpu:[1-4]", or "gpu:K80:[1-4] with the type included (optional) #SBATCH --gres=gpu:1 # # Standard output and error log #SBATCH --output=serial_test_%j.log # # Wall clock limit in hrs:min:sec: #SBATCH --time=00:00:30 # ## Command(s) to run (example): pwd; hostname; date module load python echo "Running test script on a single CPU core" python /private/groups/clusteradmin/weiler/slurm-jobs/experiment-1/mytest.py date Keep the "SBATCH" lines commented, the scheduler will read them anyway. If you don't need a particular option, just don't include it in the file. To submit the batch job: % sbatch slurm-test.sh Submitted batch job 7 The job(s) will then be scheduled. You can see the state of the queue as such: % squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 7 batch weiler_t weiler R 0:07 1 phoenix-01 The job will output any STDOUT or STDERR in the directory you launched the job from. Other than that, it will do whatever the job does, even if there is no STDOUT. == Launching Several Jobs at Once == You can launch many jobs at once using the $SLURM_ARRAY_TASK_ID variable. Add something like the following to your batch submission file: #SBATCH --array=0-31 #SBATCH --output=array_job_%A_task_%a.out #SBATCH --error=array_job_%A_task_%a.err ## Command(s) to run: echo "I am task $SLURM_ARRAY_TASK_ID" == CGROUPS and Resource Management == Our installation of Slurm will utilize Linux CGROUPS, which puts a hard resource cap on jobs. If you define that your job will need 4GB of RAM, and it uses 5GB, it will fail with an OOM exception. Likewise with CPU or GPU resources; if your job ends up using more than you specify, it will fail. Or the "--time" batch file option, your job will fail if it takes longer than what you specify there. This is to keep the nodes from crashing from runaway jobs that use more resources than you think they will. So... TEST YOUR JOBS! Find our how much in resources a single job needs before you launch 100 of them. 513be43c2adcbfc334bab414814b3af82aad09db 294 293 2023-04-09T21:21:18Z Weiler 3 wikitext text/x-wiki When using Slurm, you will need to log into the Slurm head node (currently phoenix-01.gi.ucsc.edu, a one node cluster at the moment). Once you have ssh'd in there, you can execute slurm batch or interactive commands. == Submit a Slurm Batch Job == In order to submit a Slurm batch job list, you will need to create a directory that you will have read and write access to on all the nodes (which will often be a shared space). Let's say I have a batch named "experiment-1". I would create that directory in my groups area: % mkdir -p /private/groups/clusteradmin/weiler/slurm-jobs/experiment-1 % cd /private/groups/clusteradmin/weiler/slurm-jobs/experiment-1 Then you will need to create your job submission batch file. It will look something like this. My file is called 'slurm-test.sh': % vim slurm-test.sh Then populate the file as necessary: #!/bin/bash # Job name: #SBATCH --job-name=weiler_test # # Partition - This is the queue it goes in: #SBATCH --partition=main # # Where to send email (optional) #SBATCH --mail-user=weiler@ucsc.edu # # Number of nodes you need per job: #SBATCH --nodes=1 # # Memory needed for the jobs. Try very hard to make this accurate. DEFAULT = 4gb #SBATCH --mem=4gb # # Number of tasks (one for each GPU desired for use case) (example): #SBATCH --ntasks=1 # # Processors per task: # At least eight times the number of GPUs needed for nVidia RTX A5500 #SBATCH --cpus-per-task=1 # # Number of GPUs, this can be in the format of "gpu:[1-4]", or "gpu:K80:[1-4] with the type included (optional) #SBATCH --gres=gpu:1 # # Standard output and error log #SBATCH --output=serial_test_%j.log # # Wall clock limit in hrs:min:sec: #SBATCH --time=00:00:30 # ## Command(s) to run (example): pwd; hostname; date module load python echo "Running test script on a single CPU core" python /private/groups/clusteradmin/weiler/slurm-jobs/experiment-1/mytest.py date Keep the "SBATCH" lines commented, the scheduler will read them anyway. If you don't need a particular option, just don't include it in the file. To submit the batch job: % sbatch slurm-test.sh Submitted batch job 7 The job(s) will then be scheduled. You can see the state of the queue as such: % squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 7 batch weiler_t weiler R 0:07 1 phoenix-01 The job will output any STDOUT or STDERR in the directory you launched the job from. Other than that, it will do whatever the job does, even if there is no STDOUT. == Launching Several Jobs at Once == You can launch many jobs at once using the $SLURM_ARRAY_TASK_ID variable. Add something like the following to your batch submission file: #SBATCH --array=0-31 #SBATCH --output=array_job_%A_task_%a.out #SBATCH --error=array_job_%A_task_%a.err ## Command(s) to run: echo "I am task $SLURM_ARRAY_TASK_ID" == CGROUPS and Resource Management == Our installation of Slurm will utilize Linux CGROUPS, which puts a hard resource cap on jobs. If you define that your job will need 4GB of RAM, and it uses 5GB, it will fail with an OOM exception. Likewise with CPU or GPU resources; if your job ends up using more than you specify, it will fail. Or the "--time" batch file option, your job will fail if it takes longer than what you specify there. This is to keep the nodes from crashing from runaway jobs that use more resources than you think they will. So... TEST YOUR JOBS! Find our how much in resources a single job needs before you launch 100 of them. 9e1e3e8033d01ef959d2c15bde614164b5599bff 295 294 2023-04-09T21:24:36Z Weiler 3 wikitext text/x-wiki When using Slurm, you will need to log into the Slurm head node (currently phoenix-01.gi.ucsc.edu, a one node cluster at the moment). Once you have ssh'd in there, you can execute slurm batch or interactive commands. == Submit a Slurm Batch Job == In order to submit a Slurm batch job list, you will need to create a directory that you will have read and write access to on all the nodes (which will often be a shared space). Let's say I have a batch named "experiment-1". I would create that directory in my groups area: % mkdir -p /private/groups/clusteradmin/weiler/slurm-jobs/experiment-1 % cd /private/groups/clusteradmin/weiler/slurm-jobs/experiment-1 Then you will need to create your job submission batch file. It will look something like this. My file is called 'slurm-test.sh': % vim slurm-test.sh Then populate the file as necessary: #!/bin/bash # Job name: #SBATCH --job-name=weiler_test # # Partition - This is the queue it goes in: #SBATCH --partition=main # # Where to send email (optional) #SBATCH --mail-user=weiler@ucsc.edu # # Number of nodes you need per job: #SBATCH --nodes=1 # # Memory needed for the jobs. Try very hard to make this accurate. DEFAULT = 4gb #SBATCH --mem=4gb # # Number of tasks (one for each GPU desired for use case) (example): #SBATCH --ntasks=1 # # Processors per task: # At least eight times the number of GPUs needed for nVidia RTX A5500 #SBATCH --cpus-per-task=1 # # Number of GPUs, this can be in the format of "gpu:[1-4]", or "gpu:K80:[1-4] with the type included (optional) #SBATCH --gres=gpu:1 # # Standard output and error log #SBATCH --output=serial_test_%j.log # # Wall clock limit in hrs:min:sec: #SBATCH --time=00:00:30 # ## Command(s) to run (example): pwd; hostname; date echo "Running test script on a single CPU core" echo "Test done!" date Keep the "SBATCH" lines commented, the scheduler will read them anyway. If you don't need a particular option, just don't include it in the file. To submit the batch job: % sbatch slurm-test.sh Submitted batch job 7 The job(s) will then be scheduled. You can see the state of the queue as such: % squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 7 batch weiler_t weiler R 0:07 1 phoenix-01 The job will output any STDOUT or STDERR in the directory you launched the job from. Other than that, it will do whatever the job does, even if there is no STDOUT. == Launching Several Jobs at Once == You can launch many jobs at once using the $SLURM_ARRAY_TASK_ID variable. Add something like the following to your batch submission file: #SBATCH --array=0-31 #SBATCH --output=array_job_%A_task_%a.out #SBATCH --error=array_job_%A_task_%a.err ## Command(s) to run: echo "I am task $SLURM_ARRAY_TASK_ID" == CGROUPS and Resource Management == Our installation of Slurm will utilize Linux CGROUPS, which puts a hard resource cap on jobs. If you define that your job will need 4GB of RAM, and it uses 5GB, it will fail with an OOM exception. Likewise with CPU or GPU resources; if your job ends up using more than you specify, it will fail. Or the "--time" batch file option, your job will fail if it takes longer than what you specify there. This is to keep the nodes from crashing from runaway jobs that use more resources than you think they will. So... TEST YOUR JOBS! Find our how much in resources a single job needs before you launch 100 of them. 0d924abde20ecde52cfdc15d304810b8ce1080cd 296 295 2023-04-09T21:26:42Z Weiler 3 wikitext text/x-wiki When using Slurm, you will need to log into the Slurm head node (currently phoenix-01.gi.ucsc.edu, a one node cluster at the moment). Once you have ssh'd in there, you can execute slurm batch or interactive commands. == Submit a Slurm Batch Job == In order to submit a Slurm batch job list, you will need to create a directory that you will have read and write access to on all the nodes (which will often be a shared space). Let's say I have a batch named "experiment-1". I would create that directory in my groups area: % mkdir -p /private/groups/clusteradmin/weiler/slurm-jobs/experiment-1 % cd /private/groups/clusteradmin/weiler/slurm-jobs/experiment-1 Then you will need to create your job submission batch file. It will look something like this. My file is called 'slurm-test.sh': % vim slurm-test.sh Then populate the file as necessary: #!/bin/bash # Job name: #SBATCH --job-name=weiler_test # # Partition - This is the queue it goes in: #SBATCH --partition=main # # Where to send email (optional) #SBATCH --mail-user=weiler@ucsc.edu # # Number of nodes you need per job: #SBATCH --nodes=1 # # Memory needed for the jobs. Try very hard to make this accurate. DEFAULT = 4gb #SBATCH --mem=4gb # # Number of tasks (one for each GPU desired for use case) (example): #SBATCH --ntasks=1 # # Processors per task: # At least eight times the number of GPUs needed for nVidia RTX A5500 #SBATCH --cpus-per-task=1 # # Number of GPUs, this can be in the format of "gpu:[1-4]", or "gpu:K80:[1-4] with the type included (optional) #SBATCH --gres=gpu:1 # # Standard output and error log #SBATCH --output=serial_test_%j.log # # Wall clock limit in hrs:min:sec: #SBATCH --time=00:00:30 # ## Command(s) to run (example): pwd; hostname; date echo "Running test script on a single CPU core" sleep 5 echo "Test done!" date Keep the "SBATCH" lines commented, the scheduler will read them anyway. If you don't need a particular option, just don't include it in the file. To submit the batch job: % sbatch slurm-test.sh Submitted batch job 7 The job(s) will then be scheduled. You can see the state of the queue as such: % squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 7 batch weiler_t weiler R 0:07 1 phoenix-01 The job will output any STDOUT or STDERR in the directory you launched the job from. Other than that, it will do whatever the job does, even if there is no STDOUT. == Launching Several Jobs at Once == You can launch many jobs at once using the $SLURM_ARRAY_TASK_ID variable. Add something like the following to your batch submission file: #SBATCH --array=0-31 #SBATCH --output=array_job_%A_task_%a.out #SBATCH --error=array_job_%A_task_%a.err ## Command(s) to run: echo "I am task $SLURM_ARRAY_TASK_ID" == CGROUPS and Resource Management == Our installation of Slurm will utilize Linux CGROUPS, which puts a hard resource cap on jobs. If you define that your job will need 4GB of RAM, and it uses 5GB, it will fail with an OOM exception. Likewise with CPU or GPU resources; if your job ends up using more than you specify, it will fail. Or the "--time" batch file option, your job will fail if it takes longer than what you specify there. This is to keep the nodes from crashing from runaway jobs that use more resources than you think they will. So... TEST YOUR JOBS! Find our how much in resources a single job needs before you launch 100 of them. acde99e28da56d222e2c3d2965cff9c352e8b71d 297 296 2023-05-02T03:12:06Z Weiler 3 wikitext text/x-wiki When using Slurm, you will need to log into the Slurm head node (currently phoenix-01.gi.ucsc.edu, a one node cluster at the moment). Once you have ssh'd in there, you can execute slurm batch or interactive commands. == Submit a Slurm Batch Job == In order to submit a Slurm batch job list, you will need to create a directory that you will have read and write access to on all the nodes (which will often be a shared space). Let's say I have a batch named "experiment-1". I would create that directory in my groups area: % mkdir -p /private/groups/clusteradmin/weiler/slurm-jobs/experiment-1 % cd /private/groups/clusteradmin/weiler/slurm-jobs/experiment-1 Then you will need to create your job submission batch file. It will look something like this. My file is called 'slurm-test.sh': % vim slurm-test.sh Then populate the file as necessary: #!/bin/bash # Job name: #SBATCH --job-name=weiler_test # # Partition - This is the queue it goes in: #SBATCH --partition=main # # Where to send email (optional) #SBATCH --mail-user=weiler@ucsc.edu # # Number of nodes you need per job: #SBATCH --nodes=1 # # Memory needed for the jobs. Try very hard to make this accurate. DEFAULT = 4gb #SBATCH --mem=4gb # # Number of tasks (one for each GPU desired for use case) (example): #SBATCH --ntasks=1 # # Processors per task: # At least eight times the number of GPUs needed for nVidia RTX A5500 #SBATCH --cpus-per-task=1 # # Number of GPUs, this can be in the format of "gpu:[1-4]", or "gpu:A5500:[1-4] with the type included (optional) #SBATCH --gres=gpu:1 # # Standard output and error log #SBATCH --output=serial_test_%j.log # # Wall clock limit in hrs:min:sec: #SBATCH --time=00:00:30 # ## Command(s) to run (example): pwd; hostname; date echo "Running test script on a single CPU core" sleep 5 echo "Test done!" date Keep the "SBATCH" lines commented, the scheduler will read them anyway. If you don't need a particular option, just don't include it in the file. To submit the batch job: % sbatch slurm-test.sh Submitted batch job 7 The job(s) will then be scheduled. You can see the state of the queue as such: % squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 7 batch weiler_t weiler R 0:07 1 phoenix-01 The job will output any STDOUT or STDERR in the directory you launched the job from. Other than that, it will do whatever the job does, even if there is no STDOUT. == Launching Several Jobs at Once == You can launch many jobs at once using the $SLURM_ARRAY_TASK_ID variable. Add something like the following to your batch submission file: #SBATCH --array=0-31 #SBATCH --output=array_job_%A_task_%a.out #SBATCH --error=array_job_%A_task_%a.err ## Command(s) to run: echo "I am task $SLURM_ARRAY_TASK_ID" == CGROUPS and Resource Management == Our installation of Slurm will utilize Linux CGROUPS, which puts a hard resource cap on jobs. If you define that your job will need 4GB of RAM, and it uses 5GB, it will fail with an OOM exception. Likewise with CPU or GPU resources; if your job ends up using more than you specify, it will fail. Or the "--time" batch file option, your job will fail if it takes longer than what you specify there. This is to keep the nodes from crashing from runaway jobs that use more resources than you think they will. So... TEST YOUR JOBS! Find our how much in resources a single job needs before you launch 100 of them. 52112e5a3fe89f96de5e280d77a844b3f0766d9e 300 297 2023-05-03T20:19:22Z Weiler 3 wikitext text/x-wiki When using Slurm, you will need to log into the Slurm head node (currently phoenix-01.gi.ucsc.edu, a one node cluster at the moment). Once you have ssh'd in there, you can execute slurm batch or interactive commands. == Submit a Slurm Batch Job == In order to submit a Slurm batch job list, you will need to create a directory that you will have read and write access to on all the nodes (which will often be a shared space). Let's say I have a batch named "experiment-1". I would create that directory in my groups area: % mkdir -p /private/groups/clusteradmin/weiler/slurm-jobs/experiment-1 % cd /private/groups/clusteradmin/weiler/slurm-jobs/experiment-1 Then you will need to create your job submission batch file. It will look something like this. My file is called 'slurm-test.sh': % vim slurm-test.sh Then populate the file as necessary: #!/bin/bash # Job name: #SBATCH --job-name=weiler_test # # Partition - This is the queue it goes in: #SBATCH --partition=main # # Where to send email (optional) #SBATCH --mail-user=weiler@ucsc.edu # # Number of nodes you need per job: #SBATCH --nodes=1 # # Memory needed for the jobs. Try very hard to make this accurate. DEFAULT = 4gb #SBATCH --mem=4gb # # Number of tasks (one for each GPU desired for use case) (example): #SBATCH --ntasks=1 # # Processors per task: # At least eight times the number of GPUs needed for nVidia RTX A5500 #SBATCH --cpus-per-task=1 # # Number of GPUs, this can be in the format of "--gres=gpu:[1-8]", or "--gres=gpu:A5500:[1-8] with the type included (optional) #SBATCH --gres=gpu:1 # # Standard output and error log #SBATCH --output=serial_test_%j.log # # Wall clock limit in hrs:min:sec: #SBATCH --time=00:00:30 # ## Command(s) to run (example): pwd; hostname; date echo "Running test script on a single CPU core" sleep 5 echo "Test done!" date Keep the "SBATCH" lines commented, the scheduler will read them anyway. If you don't need a particular option, just don't include it in the file. To submit the batch job: % sbatch slurm-test.sh Submitted batch job 7 The job(s) will then be scheduled. You can see the state of the queue as such: % squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 7 batch weiler_t weiler R 0:07 1 phoenix-01 The job will output any STDOUT or STDERR in the directory you launched the job from. Other than that, it will do whatever the job does, even if there is no STDOUT. == Launching Several Jobs at Once == You can launch many jobs at once using the $SLURM_ARRAY_TASK_ID variable. Add something like the following to your batch submission file: #SBATCH --array=0-31 #SBATCH --output=array_job_%A_task_%a.out #SBATCH --error=array_job_%A_task_%a.err ## Command(s) to run: echo "I am task $SLURM_ARRAY_TASK_ID" == CGROUPS and Resource Management == Our installation of Slurm will utilize Linux CGROUPS, which puts a hard resource cap on jobs. If you define that your job will need 4GB of RAM, and it uses 5GB, it will fail with an OOM exception. Likewise with CPU or GPU resources; if your job ends up using more than you specify, it will fail. Or the "--time" batch file option, your job will fail if it takes longer than what you specify there. This is to keep the nodes from crashing from runaway jobs that use more resources than you think they will. So... TEST YOUR JOBS! Find our how much in resources a single job needs before you launch 100 of them. 996ddc55eb9bc76474e3a86d5ba0058e92487887 301 300 2023-05-03T20:19:52Z Weiler 3 wikitext text/x-wiki When using Slurm, you will need to log into the Slurm head node (currently phoenix-01.gi.ucsc.edu, a one node cluster at the moment). Once you have ssh'd in there, you can execute slurm batch or interactive commands. == Submit a Slurm Batch Job == In order to submit a Slurm batch job list, you will need to create a directory that you will have read and write access to on all the nodes (which will often be a shared space). Let's say I have a batch named "experiment-1". I would create that directory in my groups area: % mkdir -p /private/groups/clusteradmin/weiler/slurm-jobs/experiment-1 % cd /private/groups/clusteradmin/weiler/slurm-jobs/experiment-1 Then you will need to create your job submission batch file. It will look something like this. My file is called 'slurm-test.sh': % vim slurm-test.sh Then populate the file as necessary: #!/bin/bash # Job name: #SBATCH --job-name=weiler_test # # Partition - This is the queue it goes in: #SBATCH --partition=main # # Where to send email (optional) #SBATCH --mail-user=weiler@ucsc.edu # # Number of nodes you need per job: #SBATCH --nodes=1 # # Memory needed for the jobs. Try very hard to make this accurate. DEFAULT = 4gb #SBATCH --mem=4gb # # Number of tasks (one for each GPU desired for use case) (example): #SBATCH --ntasks=1 # # Processors per task: # At least eight times the number of GPUs needed for nVidia RTX A5500 #SBATCH --cpus-per-task=1 # # Number of GPUs, this can be in the format of "--gres=gpu:[1-8]", or "--gres=gpu:A5500:[1-8]" with the type included (optional) #SBATCH --gres=gpu:1 # # Standard output and error log #SBATCH --output=serial_test_%j.log # # Wall clock limit in hrs:min:sec: #SBATCH --time=00:00:30 # ## Command(s) to run (example): pwd; hostname; date echo "Running test script on a single CPU core" sleep 5 echo "Test done!" date Keep the "SBATCH" lines commented, the scheduler will read them anyway. If you don't need a particular option, just don't include it in the file. To submit the batch job: % sbatch slurm-test.sh Submitted batch job 7 The job(s) will then be scheduled. You can see the state of the queue as such: % squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 7 batch weiler_t weiler R 0:07 1 phoenix-01 The job will output any STDOUT or STDERR in the directory you launched the job from. Other than that, it will do whatever the job does, even if there is no STDOUT. == Launching Several Jobs at Once == You can launch many jobs at once using the $SLURM_ARRAY_TASK_ID variable. Add something like the following to your batch submission file: #SBATCH --array=0-31 #SBATCH --output=array_job_%A_task_%a.out #SBATCH --error=array_job_%A_task_%a.err ## Command(s) to run: echo "I am task $SLURM_ARRAY_TASK_ID" == CGROUPS and Resource Management == Our installation of Slurm will utilize Linux CGROUPS, which puts a hard resource cap on jobs. If you define that your job will need 4GB of RAM, and it uses 5GB, it will fail with an OOM exception. Likewise with CPU or GPU resources; if your job ends up using more than you specify, it will fail. Or the "--time" batch file option, your job will fail if it takes longer than what you specify there. This is to keep the nodes from crashing from runaway jobs that use more resources than you think they will. So... TEST YOUR JOBS! Find our how much in resources a single job needs before you launch 100 of them. 30a9b39b917061102cb017fd938e3f532b374dc8 Annotated Slurm Script 0 33 272 2023-03-09T01:48:13Z Weiler 3 Created page with "[[Category:Scheduler]] This is a walk-through for a basic SLURM scheduler job script for a common case of a multi-threaded analysys. If the program you run is single-threaded..." wikitext text/x-wiki [[Category:Scheduler]] This is a walk-through for a basic SLURM scheduler job script for a common case of a multi-threaded analysys. If the program you run is single-threaded (can use only one CPU core) then only use '--ntasks=1' line for the cpu request instead of all three listed lines. Annotations are marked with bullet points. You can click on the link below to download the raw job script file without the annotation. Values in brackets are placeholders. You need to replace them with your own values. E.g. Change '<job name>' to something like 'blast_proj22'. We will write additional documentation on more complex job layouts for MPI jobs and other situations when a simple number of processor cores is not sufficient. {|cellspacing=30 |-style="vertical-align:top;" |style="width: 50%"| ;Set the shell to use <pre> #!/bin/bash </pre> ;Common arguments * Name the job to make it easier to see in the job queue <pre> #SBATCH --job-name=<JOBNAME> </pre> ;Email :Your email address to use for all batch system communications <pre> ##SBATCH --mail-user=<EMAIL> ##SBATCH --mail-user=<EMAIL-ONE>,<EMAIL-TWO> </pre> ;What emails to send :NONE - no emails :ALL - all emails :END,FAIL - only email if the job fails and email the summary at the end of the job <pre> #SBATCH --mail-type=FAIL,END </pre> ;Standard Output and Error log files :Use file patterns :: %j - job id :: %A-%a - Array job id (A) and task id (a) :: You can also use --error for a separate stderr log <pre> #SBATCH --output <my_job-%j.out> </pre> ;Number of nodes to use. For all non-MPI jobs this number will be equal to '1' <pre> #SBATCH --nodes=1 </pre> ;Number of tasks. For all non-MPI jobs this number will be equal to '1' <pre> #SBATCH --ntasks=1 </pre> ;Number of CPU cores to use. This number must match the argument used for the program you run. <pre> #SBATCH --cpus-per-task=4 </pre> || ;Total memory limit for the job. Default is 2 gigabytes, but units can be specified with mb or gb for Megabytes or Gigabytes. <pre> #SBATCH --mem=4gb </pre> ;Job run time in [DAYS]:HOURS:MINUTES:SECONDS :[DAYS] are optional, use when it is convenient <pre> #SBATCH --time=72:00:00 </pre> ;Optional: :A group to use if you belong to multiple groups. Otherwise, do not use. <pre> #SBATCH --account=<GROUP> </pre> :A job array, which will create many jobs (called array tasks) different only in the '<code>$SLURM_ARRAY_TASK_ID</code>' variable, similar to [[Torque_Job_Arrays]] on HiPerGator 1 <pre> #SBATCH --array=<BEGIN-END> </pre> ;Example of five tasks :<nowiki>#</nowiki>SBATCH --array=1-5 ---- ;Recommended convenient shell code to put into your job script * Add host, time, and directory name for later troubleshooting <pre> date;hostname;pwd </pre> Below is the shell script part - the commands you will run to analyze your data. The following is an example. * Load the software you need <pre> module load ncbi_blast </pre> * Run the program <pre> blastn -db nt -query input.fa -outfmt 6 -out results.xml --num_threads 4 date </pre> |} 94d1e9b77eaa1da471a51ad03d812d9c92abd6f0 275 272 2023-03-09T01:53:56Z Weiler 3 wikitext text/x-wiki [[Category:Scheduler]] This is a walk-through for a basic SLURM scheduler job script for a common case of a multi-threaded analysys. If the program you run is single-threaded (can use only one CPU core) then only use '--ntasks=1' line for the cpu request instead of all three listed lines. Annotations are marked with bullet points. You can click on the link below to download the raw job script file without the annotation. Values in brackets are placeholders. You need to replace them with your own values. E.g. Change '<job name>' to something like 'blast_proj22'. We will write additional documentation on more complex job layouts for MPI jobs and other situations when a simple number of processor cores is not sufficient. {|cellspacing=30 |-style="vertical-align:top;" |style="width: 50%"| ;Set the shell to use <pre> #!/bin/bash </pre> ;Common arguments * Name the job to make it easier to see in the job queue <pre> #SBATCH --job-name=<JOBNAME> </pre> ;Email :Your email address to use for all batch system communications <pre> ##SBATCH --mail-user=<EMAIL> ##SBATCH --mail-user=<EMAIL-ONE>,<EMAIL-TWO> </pre> ;GPUs :How many GPUs your job will require <pre> #SBATCH --gres=gpu:1 </pre> ;What emails to send :NONE - no emails :ALL - all emails :END,FAIL - only email if the job fails and email the summary at the end of the job <pre> #SBATCH --mail-type=FAIL,END </pre> ;Standard Output and Error log files :Use file patterns :: %j - job id :: %A-%a - Array job id (A) and task id (a) :: You can also use --error for a separate stderr log <pre> #SBATCH --output <my_job-%j.out> </pre> ;Number of nodes to use. For all non-MPI jobs this number will be equal to '1' <pre> #SBATCH --nodes=1 </pre> ;Number of tasks. For all non-MPI jobs this number will be equal to '1' <pre> #SBATCH --ntasks=1 </pre> ;Number of CPU cores to use. This number must match the argument used for the program you run. <pre> #SBATCH --cpus-per-task=4 </pre> || ;Total memory limit for the job. Default is 2 gigabytes, but units can be specified with mb or gb for Megabytes or Gigabytes. <pre> #SBATCH --mem=4gb </pre> ;Job run time in [DAYS]:HOURS:MINUTES:SECONDS :[DAYS] are optional, use when it is convenient <pre> #SBATCH --time=72:00:00 </pre> ;Optional: :A group to use if you belong to multiple groups. Otherwise, do not use. <pre> #SBATCH --account=<GROUP> </pre> :A job array, which will create many jobs (called array tasks) different only in the '<code>$SLURM_ARRAY_TASK_ID</code>' variable, similar to [[Torque_Job_Arrays]] on HiPerGator 1 <pre> #SBATCH --array=<BEGIN-END> </pre> ;Example of five tasks :<nowiki>#</nowiki>SBATCH --array=1-5 ---- ;Recommended convenient shell code to put into your job script * Add host, time, and directory name for later troubleshooting <pre> date;hostname;pwd </pre> Below is the shell script part - the commands you will run to analyze your data. The following is an example. * Load the software you need <pre> module load ncbi_blast </pre> * Run the program <pre> blastn -db nt -query input.fa -outfmt 6 -out results.xml --num_threads 4 date </pre> |} 1bdc8b4bb79b1c39ee2b97ad37f42f2aee0e4eb3 276 275 2023-03-09T01:54:18Z Weiler 3 wikitext text/x-wiki [[Category:Scheduler]] This is a walk-through for a basic SLURM scheduler job script for a common case of a multi-threaded analysys. If the program you run is single-threaded (can use only one CPU core) then only use '--ntasks=1' line for the cpu request instead of all three listed lines. Annotations are marked with bullet points. You can click on the link below to download the raw job script file without the annotation. Values in brackets are placeholders. You need to replace them with your own values. E.g. Change '<job name>' to something like 'blast_proj22'. We will write additional documentation on more complex job layouts for MPI jobs and other situations when a simple number of processor cores is not sufficient. {|cellspacing=30 |-style="vertical-align:top;" |style="width: 50%"| ;Set the shell to use <pre> #!/bin/bash </pre> ;Common arguments * Name the job to make it easier to see in the job queue <pre> #SBATCH --job-name=<JOBNAME> </pre> ;Email :Your email address to use for all batch system communications <pre> #SBATCH --mail-user=<EMAIL> #SBATCH --mail-user=<EMAIL-ONE>,<EMAIL-TWO> </pre> ;GPUs :How many GPUs your job will require <pre> #SBATCH --gres=gpu:1 </pre> ;What emails to send :NONE - no emails :ALL - all emails :END,FAIL - only email if the job fails and email the summary at the end of the job <pre> #SBATCH --mail-type=FAIL,END </pre> ;Standard Output and Error log files :Use file patterns :: %j - job id :: %A-%a - Array job id (A) and task id (a) :: You can also use --error for a separate stderr log <pre> #SBATCH --output <my_job-%j.out> </pre> ;Number of nodes to use. For all non-MPI jobs this number will be equal to '1' <pre> #SBATCH --nodes=1 </pre> ;Number of tasks. For all non-MPI jobs this number will be equal to '1' <pre> #SBATCH --ntasks=1 </pre> ;Number of CPU cores to use. This number must match the argument used for the program you run. <pre> #SBATCH --cpus-per-task=4 </pre> || ;Total memory limit for the job. Default is 2 gigabytes, but units can be specified with mb or gb for Megabytes or Gigabytes. <pre> #SBATCH --mem=4gb </pre> ;Job run time in [DAYS]:HOURS:MINUTES:SECONDS :[DAYS] are optional, use when it is convenient <pre> #SBATCH --time=72:00:00 </pre> ;Optional: :A group to use if you belong to multiple groups. Otherwise, do not use. <pre> #SBATCH --account=<GROUP> </pre> :A job array, which will create many jobs (called array tasks) different only in the '<code>$SLURM_ARRAY_TASK_ID</code>' variable, similar to [[Torque_Job_Arrays]] on HiPerGator 1 <pre> #SBATCH --array=<BEGIN-END> </pre> ;Example of five tasks :<nowiki>#</nowiki>SBATCH --array=1-5 ---- ;Recommended convenient shell code to put into your job script * Add host, time, and directory name for later troubleshooting <pre> date;hostname;pwd </pre> Below is the shell script part - the commands you will run to analyze your data. The following is an example. * Load the software you need <pre> module load ncbi_blast </pre> * Run the program <pre> blastn -db nt -query input.fa -outfmt 6 -out results.xml --num_threads 4 date </pre> |} 4649ec628b8b0412e56ca395ecca384fbb0e320e 277 276 2023-03-09T01:55:03Z Weiler 3 wikitext text/x-wiki [[Category:Scheduler]] This is a walk-through for a basic SLURM scheduler job script for a common case of a multi-threaded analysis. If the program you run is single-threaded (can use only one CPU core) then only use '--ntasks=1' line for the cpu request instead of all three listed lines. Annotations are marked with bullet points. You can click on the link below to download the raw job script file without the annotation. Values in brackets are placeholders. You need to replace them with your own values. E.g. Change '<job name>' to something like 'blast_proj22'. We will write additional documentation on more complex job layouts for MPI jobs and other situations when a simple number of processor cores is not sufficient. {|cellspacing=30 |-style="vertical-align:top;" |style="width: 50%"| ;Set the shell to use <pre> #!/bin/bash </pre> ;Common arguments * Name the job to make it easier to see in the job queue <pre> #SBATCH --job-name=<JOBNAME> </pre> ;Email :Your email address to use for all batch system communications <pre> #SBATCH --mail-user=<EMAIL> #SBATCH --mail-user=<EMAIL-ONE>,<EMAIL-TWO> </pre> ;GPUs :How many GPUs your job will require <pre> #SBATCH --gres=gpu:1 </pre> ;What emails to send :NONE - no emails :ALL - all emails :END,FAIL - only email if the job fails and email the summary at the end of the job <pre> #SBATCH --mail-type=FAIL,END </pre> ;Standard Output and Error log files :Use file patterns :: %j - job id :: %A-%a - Array job id (A) and task id (a) :: You can also use --error for a separate stderr log <pre> #SBATCH --output <my_job-%j.out> </pre> ;Number of nodes to use. For all non-MPI jobs this number will be equal to '1' <pre> #SBATCH --nodes=1 </pre> ;Number of tasks. For all non-MPI jobs this number will be equal to '1' <pre> #SBATCH --ntasks=1 </pre> ;Number of CPU cores to use. This number must match the argument used for the program you run. <pre> #SBATCH --cpus-per-task=4 </pre> || ;Total memory limit for the job. Default is 2 gigabytes, but units can be specified with mb or gb for Megabytes or Gigabytes. <pre> #SBATCH --mem=4gb </pre> ;Job run time in [DAYS]:HOURS:MINUTES:SECONDS :[DAYS] are optional, use when it is convenient <pre> #SBATCH --time=72:00:00 </pre> ;Optional: :A group to use if you belong to multiple groups. Otherwise, do not use. <pre> #SBATCH --account=<GROUP> </pre> :A job array, which will create many jobs (called array tasks) different only in the '<code>$SLURM_ARRAY_TASK_ID</code>' variable, similar to [[Torque_Job_Arrays]] on HiPerGator 1 <pre> #SBATCH --array=<BEGIN-END> </pre> ;Example of five tasks :<nowiki>#</nowiki>SBATCH --array=1-5 ---- ;Recommended convenient shell code to put into your job script * Add host, time, and directory name for later troubleshooting <pre> date;hostname;pwd </pre> Below is the shell script part - the commands you will run to analyze your data. The following is an example. * Load the software you need <pre> module load ncbi_blast </pre> * Run the program <pre> blastn -db nt -query input.fa -outfmt 6 -out results.xml --num_threads 4 date </pre> |} 33034a4756195ba0275d9831cea8bc23a7e7475b Job Arrays 0 34 282 2023-03-09T03:28:42Z Weiler 3 Created page with "== Job Array Support == == Overview == Job arrays offer a mechanism for submitting and managing collections of similar jobs quickly and easily; job arrays with millions of t..." wikitext text/x-wiki == Job Array Support == == Overview == Job arrays offer a mechanism for submitting and managing collections of similar jobs quickly and easily; job arrays with millions of tasks can be submitted in milliseconds (subject to configured size limits). All jobs must have the same initial options (e.g. size, time limit, etc.), however it is possible to change some of these options after the job has begun execution using the scontrol command specifying the JobID of the array or individual ArrayJobID. $ scontrol update job=101 ... $ scontrol update job=101_1 ... Job arrays are only supported for batch jobs and the array index values are specified using the --array or -a option of the sbatch command. The option argument can be specific array index values, a range of index values, and an optional step size as shown in the examples below. Note that the minimum index value is zero and the maximum value is a Slurm configuration parameter (MaxArraySize minus one). Jobs which are part of a job array will have the environment variable SLURM_ARRAY_TASK_ID set to its array index value. # Submit a job array with index values between 0 and 31 $ sbatch --array=0-31 -N1 tmp # Submit a job array with index values of 1, 3, 5 and 7 $ sbatch --array=1,3,5,7 -N1 tmp # Submit a job array with index values between 1 and 7 # with a step size of 2 (i.e. 1, 3, 5 and 7) $ sbatch --array=1-7:2 -N1 tmp A maximum number of simultaneously running tasks from the job array may be specified using a "%" separator. For example "--array=0-15%4" will limit the number of simultaneously running tasks from this job array to 4. Job ID and Environment Variables Job arrays will have two additional environment variable set. SLURM_ARRAY_JOB_ID will be set to the first job ID of the array. SLURM_ARRAY_TASK_ID will be set to the job array index value. SLURM_ARRAY_TASK_COUNT will be set to the number of tasks in the job array. SLURM_ARRAY_TASK_MAX will be set to the highest job array index value. SLURM_ARRAY_TASK_MIN will be set to the lowest job array index value. For example a job submission of this sort sbatch --array=1-3 -N1 tmp will generate a job array containing three jobs. If the sbatch command responds Submitted batch job 36 then the environment variables will be set as follows: SLURM_JOB_ID=36 SLURM_ARRAY_JOB_ID=36 SLURM_ARRAY_TASK_ID=1 SLURM_ARRAY_TASK_COUNT=3 SLURM_ARRAY_TASK_MAX=3 SLURM_ARRAY_TASK_MIN=1 SLURM_JOB_ID=37 SLURM_ARRAY_JOB_ID=36 SLURM_ARRAY_TASK_ID=2 SLURM_ARRAY_TASK_COUNT=3 SLURM_ARRAY_TASK_MAX=3 SLURM_ARRAY_TASK_MIN=1 SLURM_JOB_ID=38 SLURM_ARRAY_JOB_ID=36 SLURM_ARRAY_TASK_ID=3 SLURM_ARRAY_TASK_COUNT=3 SLURM_ARRAY_TASK_MAX=3 SLURM_ARRAY_TASK_MIN=1 All Slurm commands and APIs recognize the SLURM_JOB_ID value. Most commands also recognize the SLURM_ARRAY_JOB_ID plus SLURM_ARRAY_TASK_ID values separated by an underscore as identifying an element of a job array. Using the example above, "37" or "36_2" would be equivalent ways to identify the second array element of job 36. A set of APIs has been developed to operate on an entire job array or select tasks of a job array in a single function call. The function response consists of an array identifying the various error codes for various tasks of a job ID. For example the job_resume2() function might return an array of error codes indicating that tasks 1 and 2 have already completed; tasks 3 through 5 are resumed successfully, and tasks 6 through 99 have not yet started. File Names Two additional options are available to specify a job's stdin, stdout, and stderr file names: %A will be replaced by the value of SLURM_ARRAY_JOB_ID (as defined above) and %a will be replaced by the value of SLURM_ARRAY_TASK_ID (as defined above). The default output file format for a job array is "slurm-%A_%a.out". An example of explicit use of the formatting is: sbatch -o slurm-%A_%a.out --array=1-3 -N1 tmp which would generate output files names of this sort "slurm-36_1.out", "slurm-36_2.out" and "slurm-36_3.out". If these file name options are used without being part of a job array then "%A" will be replaced by the current job ID and "%a" will be replaced by 4,294,967,294 (equivalent to 0xfffffffe or NO_VAL). Scancel Command Use If the job ID of a job array is specified as input to the scancel command then all elements of that job array will be cancelled. Alternately an array ID, optionally using regular expressions, may be specified for job cancellation. # Cancel array ID 1 to 3 from job array 20 $ scancel 20_[1-3] # Cancel array ID 4 and 5 from job array 20 $ scancel 20_4 20_5 # Cancel all elements from job array 20 $ scancel 20 # Cancel the current job or job array element (if job array) if [[-z $SLURM_ARRAY_JOB_ID]]; then scancel $SLURM_JOB_ID else scancel ${SLURM_ARRAY_JOB_ID}_${SLURM_ARRAY_TASK_ID} fi == Squeue Command Use == When a job array is submitted to Slurm, only one job record is created. Additional job records will only be created when the state of a task in the job array changes, typically when a task is allocated resources or its state is modified using the scontrol command. By default, the squeue command will report all of the tasks associated with a single job record on one line and use a regular expression to indicate the "array_task_id" values as shown below. $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1080_[5-1024] debug tmp mac PD 0:00 1 (Resources) 1080_1 debug tmp mac R 0:17 1 tux0 1080_2 debug tmp mac R 0:16 1 tux1 1080_3 debug tmp mac R 0:03 1 tux2 1080_4 debug tmp mac R 0:03 1 tux3 An option of "--array" or "-r" has also been added to the squeue command to print one job array element per line as shown below. The environment variable "SQUEUE_ARRAY" is equivalent to including the "--array" option on the squeue command line. $ squeue -r JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1082_3 debug tmp mac PD 0:00 1 (Resources) 1082_4 debug tmp mac PD 0:00 1 (Priority) 1080 debug tmp mac R 0:17 1 tux0 1081 debug tmp mac R 0:16 1 tux1 1082_1 debug tmp mac R 0:03 1 tux2 1082_2 debug tmp mac R 0:03 1 tux3 The squeue --step/-s and --job/-j options can accept job or step specifications of the same format. $ squeue -j 1234_2,1234_3 ... $ squeue -s 1234_2.0,1234_3.0 ... Two additional job output format field options have been added to squeue: %F prints the array_job_id value %K prints the array_task_id value (all of the obvious letters to use were already assigned to other job fields). Scontrol Command Use Use of the scontrol show job option shows two new fields related to job array support. The JobID is a unique identifier for the job. The ArrayJobID is the JobID of the first element of the job array. The ArrayTaskID is the array index of this particular entry, either a single number of an expression identifying the entries represented by this job record (e.g. "5-1024"). Neither field is displayed if the job is not part of a job array. The optional job ID specified with the scontrol show job or scontrol show step commands can identify job array elements by specifying ArrayJobId and ArrayTaskId with an underscore between them (e.g. <ArrayJobID>_<ArrayTaskId>). The scontrol command will operate on all elements of a job array if the job ID specified is ArrayJobID. Individual job array tasks can be modified using the ArrayJobID_ArrayTaskID as shown below. $ sbatch --array=1-4 -J array ./sleepme 86400 Submitted batch job 21845 $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST 21845_1 canopo array david R 0:13 1 dario 21845_2 canopo array david R 0:13 1 dario 21845_3 canopo array david R 0:13 1 dario 21845_4 canopo array david R 0:13 1 dario $ scontrol update JobID=21845_2 name=arturo $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST 21845_1 canopo array david R 17:03 1 dario 21845_2 canopo arturo david R 17:03 1 dario 21845_3 canopo array david R 17:03 1 dario 21845_4 canopo array david R 17:03 1 dario The scontrol hold, holdu, release, requeue, requeuehold, suspend and resume commands can also either operate on all elements of a job array or individual elements as shown below. $ scontrol suspend 21845 $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST 21845_1 canopo array david S 25:12 1 dario 21845_2 canopo arturo david S 25:12 1 dario 21845_3 canopo array david S 25:12 1 dario 21845_4 canopo array david S 25:12 1 dario $ scontrol resume 21845 $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST 21845_1 canopo array david R 25:14 1 dario 21845_2 canopo arturo david R 25:14 1 dario 21845_3 canopo array david R 25:14 1 dario 21845_4 canopo array david R 25:14 1 dario scontrol suspend 21845_3 $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST 21845_1 canopo array david R 25:14 1 dario 21845_2 canopo arturo david R 25:14 1 dario 21845_3 canopo array david S 25:14 1 dario 21845_4 canopo array david R 25:14 1 dario scontrol resume 21845_3 $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST 21845_1 canopo array david R 25:14 1 dario 21845_2 canopo arturo david R 25:14 1 dario 21845_3 canopo array david R 25:14 1 dario 21845_4 canopo array david R 25:14 1 dario == Job Dependencies == A job which is to be dependent upon an entire job array should specify itself dependent upon the ArrayJobID. Since each array element can have a different exit code, the interpretation of the afterok and afternotok clauses will be based upon the highest exit code from any task in the job array. When a job dependency specifies the job ID of a job array: The after clause is satisfied after all tasks in the job array start. The afterany clause is satisfied after all tasks in the job array complete. The aftercorr clause is satisfied after the corresponding task ID in the specified job has completed successfully (ran to completion with an exit code of zero). The afterok clause is satisfied after all tasks in the job array complete successfully. The afternotok clause is satisfied after all tasks in the job array complete with at least one tasks not completing successfully. Examples of use are shown below: # Wait for specific job array elements sbatch --depend=after:123_4 my.job sbatch --depend=afterok:123_4:123_8 my.job2 # Wait for entire job array to complete sbatch --depend=afterany:123 my.job # Wait for corresponding job array elements sbatch --depend=aftercorr:123 my.job # Wait for entire job array to complete successfully sbatch --depend=afterok:123 my.job # Wait for entire job array to complete and at least one task fails sbatch --depend=afternotok:123 my.job == Other Command Use == The following Slurm commands do not currently recognize job arrays and their use requires the use of Slurm job IDs, which are unique for each array element: sbcast, sprio, sreport, sshare and sstat. The sacct, sattach and strigger commands have been modified to permit specification of either job IDs or job array elements. The sview command has been modified to permit display of a job's ArrayJobId and ArrayTaskId fields. Both fields are displayed with a value of "N/A" if the job is not part of a job array. System Administration A new configuration parameter has been added to control the maximum job array size: MaxArraySize. The smallest index that can be specified by a user is zero and the maximum index is MaxArraySize minus one. The default value of MaxArraySize is 1001. The maximum MaxArraySize supported in Slurm is 4000001. Be mindful about the value of MaxArraySize as job arrays offer an easy way for users to submit large numbers of jobs very quickly. The sched/backfill plugin has been modified to improve performance with job arrays. Once one element of a job array is discovered to not be runnable or impact the scheduling of pending jobs, the remaining elements of that job array will be quickly skipped. Slurm creates a single job record when a job array is submitted. Additional job records are only created as needed, typically when a task of a job array is started, which provides a very scalable mechanism to manage large job counts. Each task of the job array will share the same ArrayJobId but will have their own unique ArrayTaskId. In addition to the ArrayJobId, each job will have a unique JobId that gets assigned as the tasks are started. 417163ddf6bcc49c2f9e39549ed13b65f86ffbd9 283 282 2023-03-09T03:29:05Z Weiler 3 wikitext text/x-wiki == Overview == Job arrays offer a mechanism for submitting and managing collections of similar jobs quickly and easily; job arrays with millions of tasks can be submitted in milliseconds (subject to configured size limits). All jobs must have the same initial options (e.g. size, time limit, etc.), however it is possible to change some of these options after the job has begun execution using the scontrol command specifying the JobID of the array or individual ArrayJobID. $ scontrol update job=101 ... $ scontrol update job=101_1 ... Job arrays are only supported for batch jobs and the array index values are specified using the --array or -a option of the sbatch command. The option argument can be specific array index values, a range of index values, and an optional step size as shown in the examples below. Note that the minimum index value is zero and the maximum value is a Slurm configuration parameter (MaxArraySize minus one). Jobs which are part of a job array will have the environment variable SLURM_ARRAY_TASK_ID set to its array index value. # Submit a job array with index values between 0 and 31 $ sbatch --array=0-31 -N1 tmp # Submit a job array with index values of 1, 3, 5 and 7 $ sbatch --array=1,3,5,7 -N1 tmp # Submit a job array with index values between 1 and 7 # with a step size of 2 (i.e. 1, 3, 5 and 7) $ sbatch --array=1-7:2 -N1 tmp A maximum number of simultaneously running tasks from the job array may be specified using a "%" separator. For example "--array=0-15%4" will limit the number of simultaneously running tasks from this job array to 4. Job ID and Environment Variables Job arrays will have two additional environment variable set. SLURM_ARRAY_JOB_ID will be set to the first job ID of the array. SLURM_ARRAY_TASK_ID will be set to the job array index value. SLURM_ARRAY_TASK_COUNT will be set to the number of tasks in the job array. SLURM_ARRAY_TASK_MAX will be set to the highest job array index value. SLURM_ARRAY_TASK_MIN will be set to the lowest job array index value. For example a job submission of this sort sbatch --array=1-3 -N1 tmp will generate a job array containing three jobs. If the sbatch command responds Submitted batch job 36 then the environment variables will be set as follows: SLURM_JOB_ID=36 SLURM_ARRAY_JOB_ID=36 SLURM_ARRAY_TASK_ID=1 SLURM_ARRAY_TASK_COUNT=3 SLURM_ARRAY_TASK_MAX=3 SLURM_ARRAY_TASK_MIN=1 SLURM_JOB_ID=37 SLURM_ARRAY_JOB_ID=36 SLURM_ARRAY_TASK_ID=2 SLURM_ARRAY_TASK_COUNT=3 SLURM_ARRAY_TASK_MAX=3 SLURM_ARRAY_TASK_MIN=1 SLURM_JOB_ID=38 SLURM_ARRAY_JOB_ID=36 SLURM_ARRAY_TASK_ID=3 SLURM_ARRAY_TASK_COUNT=3 SLURM_ARRAY_TASK_MAX=3 SLURM_ARRAY_TASK_MIN=1 All Slurm commands and APIs recognize the SLURM_JOB_ID value. Most commands also recognize the SLURM_ARRAY_JOB_ID plus SLURM_ARRAY_TASK_ID values separated by an underscore as identifying an element of a job array. Using the example above, "37" or "36_2" would be equivalent ways to identify the second array element of job 36. A set of APIs has been developed to operate on an entire job array or select tasks of a job array in a single function call. The function response consists of an array identifying the various error codes for various tasks of a job ID. For example the job_resume2() function might return an array of error codes indicating that tasks 1 and 2 have already completed; tasks 3 through 5 are resumed successfully, and tasks 6 through 99 have not yet started. File Names Two additional options are available to specify a job's stdin, stdout, and stderr file names: %A will be replaced by the value of SLURM_ARRAY_JOB_ID (as defined above) and %a will be replaced by the value of SLURM_ARRAY_TASK_ID (as defined above). The default output file format for a job array is "slurm-%A_%a.out". An example of explicit use of the formatting is: sbatch -o slurm-%A_%a.out --array=1-3 -N1 tmp which would generate output files names of this sort "slurm-36_1.out", "slurm-36_2.out" and "slurm-36_3.out". If these file name options are used without being part of a job array then "%A" will be replaced by the current job ID and "%a" will be replaced by 4,294,967,294 (equivalent to 0xfffffffe or NO_VAL). Scancel Command Use If the job ID of a job array is specified as input to the scancel command then all elements of that job array will be cancelled. Alternately an array ID, optionally using regular expressions, may be specified for job cancellation. # Cancel array ID 1 to 3 from job array 20 $ scancel 20_[1-3] # Cancel array ID 4 and 5 from job array 20 $ scancel 20_4 20_5 # Cancel all elements from job array 20 $ scancel 20 # Cancel the current job or job array element (if job array) if [[-z $SLURM_ARRAY_JOB_ID]]; then scancel $SLURM_JOB_ID else scancel ${SLURM_ARRAY_JOB_ID}_${SLURM_ARRAY_TASK_ID} fi == Squeue Command Use == When a job array is submitted to Slurm, only one job record is created. Additional job records will only be created when the state of a task in the job array changes, typically when a task is allocated resources or its state is modified using the scontrol command. By default, the squeue command will report all of the tasks associated with a single job record on one line and use a regular expression to indicate the "array_task_id" values as shown below. $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1080_[5-1024] debug tmp mac PD 0:00 1 (Resources) 1080_1 debug tmp mac R 0:17 1 tux0 1080_2 debug tmp mac R 0:16 1 tux1 1080_3 debug tmp mac R 0:03 1 tux2 1080_4 debug tmp mac R 0:03 1 tux3 An option of "--array" or "-r" has also been added to the squeue command to print one job array element per line as shown below. The environment variable "SQUEUE_ARRAY" is equivalent to including the "--array" option on the squeue command line. $ squeue -r JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1082_3 debug tmp mac PD 0:00 1 (Resources) 1082_4 debug tmp mac PD 0:00 1 (Priority) 1080 debug tmp mac R 0:17 1 tux0 1081 debug tmp mac R 0:16 1 tux1 1082_1 debug tmp mac R 0:03 1 tux2 1082_2 debug tmp mac R 0:03 1 tux3 The squeue --step/-s and --job/-j options can accept job or step specifications of the same format. $ squeue -j 1234_2,1234_3 ... $ squeue -s 1234_2.0,1234_3.0 ... Two additional job output format field options have been added to squeue: %F prints the array_job_id value %K prints the array_task_id value (all of the obvious letters to use were already assigned to other job fields). Scontrol Command Use Use of the scontrol show job option shows two new fields related to job array support. The JobID is a unique identifier for the job. The ArrayJobID is the JobID of the first element of the job array. The ArrayTaskID is the array index of this particular entry, either a single number of an expression identifying the entries represented by this job record (e.g. "5-1024"). Neither field is displayed if the job is not part of a job array. The optional job ID specified with the scontrol show job or scontrol show step commands can identify job array elements by specifying ArrayJobId and ArrayTaskId with an underscore between them (e.g. <ArrayJobID>_<ArrayTaskId>). The scontrol command will operate on all elements of a job array if the job ID specified is ArrayJobID. Individual job array tasks can be modified using the ArrayJobID_ArrayTaskID as shown below. $ sbatch --array=1-4 -J array ./sleepme 86400 Submitted batch job 21845 $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST 21845_1 canopo array david R 0:13 1 dario 21845_2 canopo array david R 0:13 1 dario 21845_3 canopo array david R 0:13 1 dario 21845_4 canopo array david R 0:13 1 dario $ scontrol update JobID=21845_2 name=arturo $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST 21845_1 canopo array david R 17:03 1 dario 21845_2 canopo arturo david R 17:03 1 dario 21845_3 canopo array david R 17:03 1 dario 21845_4 canopo array david R 17:03 1 dario The scontrol hold, holdu, release, requeue, requeuehold, suspend and resume commands can also either operate on all elements of a job array or individual elements as shown below. $ scontrol suspend 21845 $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST 21845_1 canopo array david S 25:12 1 dario 21845_2 canopo arturo david S 25:12 1 dario 21845_3 canopo array david S 25:12 1 dario 21845_4 canopo array david S 25:12 1 dario $ scontrol resume 21845 $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST 21845_1 canopo array david R 25:14 1 dario 21845_2 canopo arturo david R 25:14 1 dario 21845_3 canopo array david R 25:14 1 dario 21845_4 canopo array david R 25:14 1 dario scontrol suspend 21845_3 $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST 21845_1 canopo array david R 25:14 1 dario 21845_2 canopo arturo david R 25:14 1 dario 21845_3 canopo array david S 25:14 1 dario 21845_4 canopo array david R 25:14 1 dario scontrol resume 21845_3 $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST 21845_1 canopo array david R 25:14 1 dario 21845_2 canopo arturo david R 25:14 1 dario 21845_3 canopo array david R 25:14 1 dario 21845_4 canopo array david R 25:14 1 dario == Job Dependencies == A job which is to be dependent upon an entire job array should specify itself dependent upon the ArrayJobID. Since each array element can have a different exit code, the interpretation of the afterok and afternotok clauses will be based upon the highest exit code from any task in the job array. When a job dependency specifies the job ID of a job array: The after clause is satisfied after all tasks in the job array start. The afterany clause is satisfied after all tasks in the job array complete. The aftercorr clause is satisfied after the corresponding task ID in the specified job has completed successfully (ran to completion with an exit code of zero). The afterok clause is satisfied after all tasks in the job array complete successfully. The afternotok clause is satisfied after all tasks in the job array complete with at least one tasks not completing successfully. Examples of use are shown below: # Wait for specific job array elements sbatch --depend=after:123_4 my.job sbatch --depend=afterok:123_4:123_8 my.job2 # Wait for entire job array to complete sbatch --depend=afterany:123 my.job # Wait for corresponding job array elements sbatch --depend=aftercorr:123 my.job # Wait for entire job array to complete successfully sbatch --depend=afterok:123 my.job # Wait for entire job array to complete and at least one task fails sbatch --depend=afternotok:123 my.job == Other Command Use == The following Slurm commands do not currently recognize job arrays and their use requires the use of Slurm job IDs, which are unique for each array element: sbcast, sprio, sreport, sshare and sstat. The sacct, sattach and strigger commands have been modified to permit specification of either job IDs or job array elements. The sview command has been modified to permit display of a job's ArrayJobId and ArrayTaskId fields. Both fields are displayed with a value of "N/A" if the job is not part of a job array. System Administration A new configuration parameter has been added to control the maximum job array size: MaxArraySize. The smallest index that can be specified by a user is zero and the maximum index is MaxArraySize minus one. The default value of MaxArraySize is 1001. The maximum MaxArraySize supported in Slurm is 4000001. Be mindful about the value of MaxArraySize as job arrays offer an easy way for users to submit large numbers of jobs very quickly. The sched/backfill plugin has been modified to improve performance with job arrays. Once one element of a job array is discovered to not be runnable or impact the scheduling of pending jobs, the remaining elements of that job array will be quickly skipped. Slurm creates a single job record when a job array is submitted. Additional job records are only created as needed, typically when a task of a job array is started, which provides a very scalable mechanism to manage large job counts. Each task of the job array will share the same ArrayJobId but will have their own unique ArrayTaskId. In addition to the ArrayJobId, each job will have a unique JobId that gets assigned as the tasks are started. 2479d55cb3ad7ee3bb31e3015bfcdd54fd2a97fd 284 283 2023-03-09T03:29:55Z Weiler 3 /* Squeue Command Use */ wikitext text/x-wiki == Overview == Job arrays offer a mechanism for submitting and managing collections of similar jobs quickly and easily; job arrays with millions of tasks can be submitted in milliseconds (subject to configured size limits). All jobs must have the same initial options (e.g. size, time limit, etc.), however it is possible to change some of these options after the job has begun execution using the scontrol command specifying the JobID of the array or individual ArrayJobID. $ scontrol update job=101 ... $ scontrol update job=101_1 ... Job arrays are only supported for batch jobs and the array index values are specified using the --array or -a option of the sbatch command. The option argument can be specific array index values, a range of index values, and an optional step size as shown in the examples below. Note that the minimum index value is zero and the maximum value is a Slurm configuration parameter (MaxArraySize minus one). Jobs which are part of a job array will have the environment variable SLURM_ARRAY_TASK_ID set to its array index value. # Submit a job array with index values between 0 and 31 $ sbatch --array=0-31 -N1 tmp # Submit a job array with index values of 1, 3, 5 and 7 $ sbatch --array=1,3,5,7 -N1 tmp # Submit a job array with index values between 1 and 7 # with a step size of 2 (i.e. 1, 3, 5 and 7) $ sbatch --array=1-7:2 -N1 tmp A maximum number of simultaneously running tasks from the job array may be specified using a "%" separator. For example "--array=0-15%4" will limit the number of simultaneously running tasks from this job array to 4. Job ID and Environment Variables Job arrays will have two additional environment variable set. SLURM_ARRAY_JOB_ID will be set to the first job ID of the array. SLURM_ARRAY_TASK_ID will be set to the job array index value. SLURM_ARRAY_TASK_COUNT will be set to the number of tasks in the job array. SLURM_ARRAY_TASK_MAX will be set to the highest job array index value. SLURM_ARRAY_TASK_MIN will be set to the lowest job array index value. For example a job submission of this sort sbatch --array=1-3 -N1 tmp will generate a job array containing three jobs. If the sbatch command responds Submitted batch job 36 then the environment variables will be set as follows: SLURM_JOB_ID=36 SLURM_ARRAY_JOB_ID=36 SLURM_ARRAY_TASK_ID=1 SLURM_ARRAY_TASK_COUNT=3 SLURM_ARRAY_TASK_MAX=3 SLURM_ARRAY_TASK_MIN=1 SLURM_JOB_ID=37 SLURM_ARRAY_JOB_ID=36 SLURM_ARRAY_TASK_ID=2 SLURM_ARRAY_TASK_COUNT=3 SLURM_ARRAY_TASK_MAX=3 SLURM_ARRAY_TASK_MIN=1 SLURM_JOB_ID=38 SLURM_ARRAY_JOB_ID=36 SLURM_ARRAY_TASK_ID=3 SLURM_ARRAY_TASK_COUNT=3 SLURM_ARRAY_TASK_MAX=3 SLURM_ARRAY_TASK_MIN=1 All Slurm commands and APIs recognize the SLURM_JOB_ID value. Most commands also recognize the SLURM_ARRAY_JOB_ID plus SLURM_ARRAY_TASK_ID values separated by an underscore as identifying an element of a job array. Using the example above, "37" or "36_2" would be equivalent ways to identify the second array element of job 36. A set of APIs has been developed to operate on an entire job array or select tasks of a job array in a single function call. The function response consists of an array identifying the various error codes for various tasks of a job ID. For example the job_resume2() function might return an array of error codes indicating that tasks 1 and 2 have already completed; tasks 3 through 5 are resumed successfully, and tasks 6 through 99 have not yet started. File Names Two additional options are available to specify a job's stdin, stdout, and stderr file names: %A will be replaced by the value of SLURM_ARRAY_JOB_ID (as defined above) and %a will be replaced by the value of SLURM_ARRAY_TASK_ID (as defined above). The default output file format for a job array is "slurm-%A_%a.out". An example of explicit use of the formatting is: sbatch -o slurm-%A_%a.out --array=1-3 -N1 tmp which would generate output files names of this sort "slurm-36_1.out", "slurm-36_2.out" and "slurm-36_3.out". If these file name options are used without being part of a job array then "%A" will be replaced by the current job ID and "%a" will be replaced by 4,294,967,294 (equivalent to 0xfffffffe or NO_VAL). Scancel Command Use If the job ID of a job array is specified as input to the scancel command then all elements of that job array will be cancelled. Alternately an array ID, optionally using regular expressions, may be specified for job cancellation. # Cancel array ID 1 to 3 from job array 20 $ scancel 20_[1-3] # Cancel array ID 4 and 5 from job array 20 $ scancel 20_4 20_5 # Cancel all elements from job array 20 $ scancel 20 # Cancel the current job or job array element (if job array) if [[-z $SLURM_ARRAY_JOB_ID]]; then scancel $SLURM_JOB_ID else scancel ${SLURM_ARRAY_JOB_ID}_${SLURM_ARRAY_TASK_ID} fi == Squeue Command Use == When a job array is submitted to Slurm, only one job record is created. Additional job records will only be created when the state of a task in the job array changes, typically when a task is allocated resources or its state is modified using the scontrol command. By default, the squeue command will report all of the tasks associated with a single job record on one line and use a regular expression to indicate the "array_task_id" values as shown below. $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1080_[5-1024] debug tmp mac PD 0:00 1 (Resources) 1080_1 debug tmp mac R 0:17 1 tux0 1080_2 debug tmp mac R 0:16 1 tux1 1080_3 debug tmp mac R 0:03 1 tux2 1080_4 debug tmp mac R 0:03 1 tux3 An option of "--array" or "-r" has also been added to the squeue command to print one job array element per line as shown below. The environment variable "SQUEUE_ARRAY" is equivalent to including the "--array" option on the squeue command line. $ squeue -r JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1082_3 debug tmp mac PD 0:00 1 (Resources) 1082_4 debug tmp mac PD 0:00 1 (Priority) 1080 debug tmp mac R 0:17 1 tux0 1081 debug tmp mac R 0:16 1 tux1 1082_1 debug tmp mac R 0:03 1 tux2 1082_2 debug tmp mac R 0:03 1 tux3 The squeue --step/-s and --job/-j options can accept job or step specifications of the same format. $ squeue -j 1234_2,1234_3 ... $ squeue -s 1234_2.0,1234_3.0 ... Two additional job output format field options have been added to squeue: %F prints the array_job_id value %K prints the array_task_id value (all of the obvious letters to use were already assigned to other job fields). == Scontrol Command Use == Use of the scontrol show job option shows two new fields related to job array support. The JobID is a unique identifier for the job. The ArrayJobID is the JobID of the first element of the job array. The ArrayTaskID is the array index of this particular entry, either a single number of an expression identifying the entries represented by this job record (e.g. "5-1024"). Neither field is displayed if the job is not part of a job array. The optional job ID specified with the scontrol show job or scontrol show step commands can identify job array elements by specifying ArrayJobId and ArrayTaskId with an underscore between them (e.g. <ArrayJobID>_<ArrayTaskId>). The scontrol command will operate on all elements of a job array if the job ID specified is ArrayJobID. Individual job array tasks can be modified using the ArrayJobID_ArrayTaskID as shown below. $ sbatch --array=1-4 -J array ./sleepme 86400 Submitted batch job 21845 $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST 21845_1 canopo array david R 0:13 1 dario 21845_2 canopo array david R 0:13 1 dario 21845_3 canopo array david R 0:13 1 dario 21845_4 canopo array david R 0:13 1 dario $ scontrol update JobID=21845_2 name=arturo $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST 21845_1 canopo array david R 17:03 1 dario 21845_2 canopo arturo david R 17:03 1 dario 21845_3 canopo array david R 17:03 1 dario 21845_4 canopo array david R 17:03 1 dario The scontrol hold, holdu, release, requeue, requeuehold, suspend and resume commands can also either operate on all elements of a job array or individual elements as shown below. $ scontrol suspend 21845 $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST 21845_1 canopo array david S 25:12 1 dario 21845_2 canopo arturo david S 25:12 1 dario 21845_3 canopo array david S 25:12 1 dario 21845_4 canopo array david S 25:12 1 dario $ scontrol resume 21845 $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST 21845_1 canopo array david R 25:14 1 dario 21845_2 canopo arturo david R 25:14 1 dario 21845_3 canopo array david R 25:14 1 dario 21845_4 canopo array david R 25:14 1 dario scontrol suspend 21845_3 $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST 21845_1 canopo array david R 25:14 1 dario 21845_2 canopo arturo david R 25:14 1 dario 21845_3 canopo array david S 25:14 1 dario 21845_4 canopo array david R 25:14 1 dario scontrol resume 21845_3 $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST 21845_1 canopo array david R 25:14 1 dario 21845_2 canopo arturo david R 25:14 1 dario 21845_3 canopo array david R 25:14 1 dario 21845_4 canopo array david R 25:14 1 dario == Job Dependencies == A job which is to be dependent upon an entire job array should specify itself dependent upon the ArrayJobID. Since each array element can have a different exit code, the interpretation of the afterok and afternotok clauses will be based upon the highest exit code from any task in the job array. When a job dependency specifies the job ID of a job array: The after clause is satisfied after all tasks in the job array start. The afterany clause is satisfied after all tasks in the job array complete. The aftercorr clause is satisfied after the corresponding task ID in the specified job has completed successfully (ran to completion with an exit code of zero). The afterok clause is satisfied after all tasks in the job array complete successfully. The afternotok clause is satisfied after all tasks in the job array complete with at least one tasks not completing successfully. Examples of use are shown below: # Wait for specific job array elements sbatch --depend=after:123_4 my.job sbatch --depend=afterok:123_4:123_8 my.job2 # Wait for entire job array to complete sbatch --depend=afterany:123 my.job # Wait for corresponding job array elements sbatch --depend=aftercorr:123 my.job # Wait for entire job array to complete successfully sbatch --depend=afterok:123 my.job # Wait for entire job array to complete and at least one task fails sbatch --depend=afternotok:123 my.job == Other Command Use == The following Slurm commands do not currently recognize job arrays and their use requires the use of Slurm job IDs, which are unique for each array element: sbcast, sprio, sreport, sshare and sstat. The sacct, sattach and strigger commands have been modified to permit specification of either job IDs or job array elements. The sview command has been modified to permit display of a job's ArrayJobId and ArrayTaskId fields. Both fields are displayed with a value of "N/A" if the job is not part of a job array. System Administration A new configuration parameter has been added to control the maximum job array size: MaxArraySize. The smallest index that can be specified by a user is zero and the maximum index is MaxArraySize minus one. The default value of MaxArraySize is 1001. The maximum MaxArraySize supported in Slurm is 4000001. Be mindful about the value of MaxArraySize as job arrays offer an easy way for users to submit large numbers of jobs very quickly. The sched/backfill plugin has been modified to improve performance with job arrays. Once one element of a job array is discovered to not be runnable or impact the scheduling of pending jobs, the remaining elements of that job array will be quickly skipped. Slurm creates a single job record when a job array is submitted. Additional job records are only created as needed, typically when a task of a job array is started, which provides a very scalable mechanism to manage large job counts. Each task of the job array will share the same ArrayJobId but will have their own unique ArrayTaskId. In addition to the ArrayJobId, each job will have a unique JobId that gets assigned as the tasks are started. 218d0f2ac05979cb87b357beac108f1131dd1a0d Quick Reference Guide 0 35 286 2023-03-09T03:39:31Z Weiler 3 Created page with "== Job scheduling commands == {| class="wikitable" |- ! Commands ! Function ! Basic Usage ! Example |- ! sbatch ! submit a slurm job ! sbatch [script] ! $ sbatch job.sub |- !..." wikitext text/x-wiki == Job scheduling commands == {| class="wikitable" |- ! Commands ! Function ! Basic Usage ! Example |- ! sbatch ! submit a slurm job ! sbatch [script] ! $ sbatch job.sub |- ! scancel ! delete slurm batch job ! scancel [job_id] ! $ scancel 123456 |- ! scontrol hold ! hold slurm batch jobs ! scontrol hold [job_id] ! $ scontrol hold 123456 |- ! scontrol release ! release hold on slurm batch jobs ! scontrol release [job_id] ! $ scontrol release 123456 |} == Job management commands == Job Status Commands sinfo -a list all queues squeue list all jobs squeue -u userid list jobs for userid squeue -t R list running jobs smap show jobs, partitions and nodes in a graphical network topology Job script basics A typical job script will look like this: #!/bin/bash #SBATCH --nodes=1 #SBATCH --cpus-per-task=8 #SBATCH --time=02:00:00 #SBATCH --mem=128G #SBATCH --mail-user=netid@gmail.com #SBATCH --mail-type=begin #SBATCH --mail-type=end #SBATCH --error=JobName.%J.err #SBATCH --output=JobName.%J.out cd $SLURM_SUBMIT_DIR module load modulename your_commands_goes_here Lines starting with #SBATCH are for SLURM resource manager to request resources for HPC. Some important options are as follows: {| class="wikitable" |+ Caption: Batch File |- ! Option ! Examples ! Description |- --nodes #SBATCH --nodes=1 Number of nodes --cpus-per-task #SBATCH --cpus-per-task=16 Number of CPUs per node --time #SBATCH --time=HH:MM:SS Total time requested for your job --output #SBATCH -output filename STDOUT to a file --error #SBATCH --error filename STDERR to a file --mail-user #SBATCH --mail-user user@domain.edu Email address to send notifications Interactive session To start a interactive session execute the following: 1 2 3 #this command will give 1 Node for a time of 4 hours srun -N 1 -t 4:00:00 --pty /bin/bash Getting information on past jobs You can use slurm database to see how much memory your previous jobs used, e.g. the following command will report requested memory and used residential and virtual memory for job 1 2 sacct -j <JOBID> --format JobID,Partition,Submit,Start,End,NodeList%40,ReqMem,MaxRSS,MaxRSSNode,MaxRSSTask,MaxVMSize,ExitCode Aliases that provide useful information parsed from the SLURM commands Place these alias’ into your .bashrc 1 2 alias si="sinfo -o \"%20P %5D %14F %8z %10m %10d %11l %16f %N\"" alias sq="squeue -o \"%8i %12j %4t %10u %20q %20a %10g %20P %10Q %5D %11l %11L %R\"" 0b4db933930557789f1a2f44a2fafeee27080ba3 287 286 2023-03-09T03:51:22Z Weiler 3 wikitext text/x-wiki == General Commands == Get documentation on a command: man <command> Try the following commands: man sbatch man squeue man scancel == Submitting jobs == The following example script specifies a partition, time limit, memory allocation and number of cores. All your scripts should specify values for these four parameters. You can also set additional parameters as shown, such as jobname and output file. For This script performs a simple task — it generates of file of random numbers and then sorts it. A detailed explanation the script is available here. #!/bin/bash # #SBATCH -p shared # partition (queue) #SBATCH -c 1 # number of cores #SBATCH --mem 100 # memory pool for all cores #SBATCH -t 0-2:00 # time (D-HH:MM) #SBATCH -o slurm.%N.%j.out # STDOUT #SBATCH -e slurm.%N.%j.err # STDERR for i in {1..100000}; do echo $RANDOM >> SomeRandomNumbers.txt donesort SomeRandomNumbers.txt Now you can submit your job with the command: sbatch myscript.sh If you want to test your job and find out when your job is estimated to run use (note this does not actually submit the job): sbatch --test-only myscript.sh == Information on Jobs == List all current jobs for a user: squeue -u <username> List all running jobs for a user: squeue -u <username> -t RUNNING List all pending jobs for a user: squeue -u <username> -t PENDING List all current jobs in the shared partition for a user: squeue -u <username> -p shared List detailed information for a job (useful for troubleshooting): scontrol show jobid -dd <jobid> List status info for a currently running job: sstat --format=AveCPU,AvePages,AveRSS,AveVMSize,JobID -j <jobid> --allsteps Once your job has completed, you can get additional information that was not available during the run. This includes run time, memory used, etc. To get statistics on completed jobs by jobID: sacct -j <jobid> --format=JobID,JobName,MaxRSS,Elapsed To view the same information for all jobs of a user: sacct -u <username> --format=JobID,JobName,MaxRSS,Elapsed == Controlling jobs == To cancel one job: scancel <jobid> To cancel all the jobs for a user: scancel -u <username> To cancel all the pending jobs for a user: scancel -t PENDING -u <username> To cancel one or more jobs by name: scancel --name myJobName To hold a particular job from being scheduled: scontrol hold <jobid> To release a particular job to be scheduled: scontrol release <jobid> To requeue (cancel and rerun) a particular job: scontrol requeue <jobid> == Job arrays and useful commands == As shown in the commands above, its easy to refer to one job by its Job ID, or to all your jobs via your username. What if you want to refer to a subset of your jobs? The answer is to submit your job set as a job array. Then you can use the job array ID to refer to the set when running SLURM commands. == SLURM job arrays == To cancel an indexed job in a job array: scancel <jobid>_<index> e.g. scancel 1234_4 To find the original submit time for your job array sacct -j 32532756 -o submit -X --noheader | uniq == Advanced (but useful!) Commands == The following commands work for individual jobs and for job arrays, and allow easy manipulation of large numbers of jobs. You can combine these commands with the parameters shown above to provide great flexibility and precision in job control. (Note that all of these commands are entered on one line) Suspend all running jobs for a user (takes into account job arrays): squeue -ho %A -t R | xargs -n 1 scontrol suspend Resume all suspended jobs for a user: squeue -o "%.18A %.18t" -u <username> | awk '{if ($2 =="S"){print $1}}' | xargs -n 1 scontrol resume After resuming, check if any are still suspended: squeue -ho %A -u $USER -t S | wc -l View Cluster State shost 1aac19955dd7b887ca40d003976dca52388aaf7d 288 287 2023-03-09T03:52:20Z Weiler 3 /* Advanced (but useful!) Commands */ wikitext text/x-wiki == General Commands == Get documentation on a command: man <command> Try the following commands: man sbatch man squeue man scancel == Submitting jobs == The following example script specifies a partition, time limit, memory allocation and number of cores. All your scripts should specify values for these four parameters. You can also set additional parameters as shown, such as jobname and output file. For This script performs a simple task — it generates of file of random numbers and then sorts it. A detailed explanation the script is available here. #!/bin/bash # #SBATCH -p shared # partition (queue) #SBATCH -c 1 # number of cores #SBATCH --mem 100 # memory pool for all cores #SBATCH -t 0-2:00 # time (D-HH:MM) #SBATCH -o slurm.%N.%j.out # STDOUT #SBATCH -e slurm.%N.%j.err # STDERR for i in {1..100000}; do echo $RANDOM >> SomeRandomNumbers.txt donesort SomeRandomNumbers.txt Now you can submit your job with the command: sbatch myscript.sh If you want to test your job and find out when your job is estimated to run use (note this does not actually submit the job): sbatch --test-only myscript.sh == Information on Jobs == List all current jobs for a user: squeue -u <username> List all running jobs for a user: squeue -u <username> -t RUNNING List all pending jobs for a user: squeue -u <username> -t PENDING List all current jobs in the shared partition for a user: squeue -u <username> -p shared List detailed information for a job (useful for troubleshooting): scontrol show jobid -dd <jobid> List status info for a currently running job: sstat --format=AveCPU,AvePages,AveRSS,AveVMSize,JobID -j <jobid> --allsteps Once your job has completed, you can get additional information that was not available during the run. This includes run time, memory used, etc. To get statistics on completed jobs by jobID: sacct -j <jobid> --format=JobID,JobName,MaxRSS,Elapsed To view the same information for all jobs of a user: sacct -u <username> --format=JobID,JobName,MaxRSS,Elapsed == Controlling jobs == To cancel one job: scancel <jobid> To cancel all the jobs for a user: scancel -u <username> To cancel all the pending jobs for a user: scancel -t PENDING -u <username> To cancel one or more jobs by name: scancel --name myJobName To hold a particular job from being scheduled: scontrol hold <jobid> To release a particular job to be scheduled: scontrol release <jobid> To requeue (cancel and rerun) a particular job: scontrol requeue <jobid> == Job arrays and useful commands == As shown in the commands above, its easy to refer to one job by its Job ID, or to all your jobs via your username. What if you want to refer to a subset of your jobs? The answer is to submit your job set as a job array. Then you can use the job array ID to refer to the set when running SLURM commands. == SLURM job arrays == To cancel an indexed job in a job array: scancel <jobid>_<index> e.g. scancel 1234_4 To find the original submit time for your job array sacct -j 32532756 -o submit -X --noheader | uniq == Advanced (but useful!) Commands == The following commands work for individual jobs and for job arrays, and allow easy manipulation of large numbers of jobs. You can combine these commands with the parameters shown above to provide great flexibility and precision in job control. (Note that all of these commands are entered on one line) Suspend all running jobs for a user (takes into account job arrays): squeue -ho %A -t R | xargs -n 1 scontrol suspend Resume all suspended jobs for a user: squeue -o "%.18A %.18t" -u <username> | awk '{if ($2 =="S"){print $1}}' | xargs -n 1 scontrol resume After resuming, check if any are still suspended: squeue -ho %A -u $USER -t S | wc -l View Cluster State: shost 76d656f29cff085775f0f8bb942767378d706fdf GPU Resources 0 36 299 2023-05-02T05:28:39Z Weiler 3 Created page with "When submitting jobs, you can ask for GPUs in one of two ways. One is: #SBATCH --gres=gpu:1 That will ask for 1 GPU generically on a node with a free GPU. This request is..." wikitext text/x-wiki When submitting jobs, you can ask for GPUs in one of two ways. One is: #SBATCH --gres=gpu:1 That will ask for 1 GPU generically on a node with a free GPU. This request is more specific: #SBATCH --gres=gpu:A5500:3 That requests 3 A5500 GPUs '''only'''. We have several GPU types on the cluster which may fit your specific needs: nVidia RTX A5500 : 24GB RAM nVidia A100 : 80GB RAM nVidia GeForce RTX 2080 Ti : 11GB RAM nVidia GeForce RTX 1080 Ti : 11GB RAM e72da6a337984c1a4dd52bc867f4bd5cff15ff5d Overview of using Slurm 0 32 302 301 2023-05-11T15:52:28Z Anovak 4 Cross-link to quick reference wikitext text/x-wiki When using Slurm, you will need to log into the Slurm head node (currently phoenix-01.gi.ucsc.edu, a one node cluster at the moment). Once you have ssh'd in there, you can execute slurm batch or interactive commands. You might also want to consult the [[Quick Reference Guide]]. == Submit a Slurm Batch Job == In order to submit a Slurm batch job list, you will need to create a directory that you will have read and write access to on all the nodes (which will often be a shared space). Let's say I have a batch named "experiment-1". I would create that directory in my groups area: % mkdir -p /private/groups/clusteradmin/weiler/slurm-jobs/experiment-1 % cd /private/groups/clusteradmin/weiler/slurm-jobs/experiment-1 Then you will need to create your job submission batch file. It will look something like this. My file is called 'slurm-test.sh': % vim slurm-test.sh Then populate the file as necessary: #!/bin/bash # Job name: #SBATCH --job-name=weiler_test # # Partition - This is the queue it goes in: #SBATCH --partition=main # # Where to send email (optional) #SBATCH --mail-user=weiler@ucsc.edu # # Number of nodes you need per job: #SBATCH --nodes=1 # # Memory needed for the jobs. Try very hard to make this accurate. DEFAULT = 4gb #SBATCH --mem=4gb # # Number of tasks (one for each GPU desired for use case) (example): #SBATCH --ntasks=1 # # Processors per task: # At least eight times the number of GPUs needed for nVidia RTX A5500 #SBATCH --cpus-per-task=1 # # Number of GPUs, this can be in the format of "--gres=gpu:[1-8]", or "--gres=gpu:A5500:[1-8]" with the type included (optional) #SBATCH --gres=gpu:1 # # Standard output and error log #SBATCH --output=serial_test_%j.log # # Wall clock limit in hrs:min:sec: #SBATCH --time=00:00:30 # ## Command(s) to run (example): pwd; hostname; date echo "Running test script on a single CPU core" sleep 5 echo "Test done!" date Keep the "SBATCH" lines commented, the scheduler will read them anyway. If you don't need a particular option, just don't include it in the file. To submit the batch job: % sbatch slurm-test.sh Submitted batch job 7 The job(s) will then be scheduled. You can see the state of the queue as such: % squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 7 batch weiler_t weiler R 0:07 1 phoenix-01 The job will output any STDOUT or STDERR in the directory you launched the job from. Other than that, it will do whatever the job does, even if there is no STDOUT. == Launching Several Jobs at Once == You can launch many jobs at once using the $SLURM_ARRAY_TASK_ID variable. Add something like the following to your batch submission file: #SBATCH --array=0-31 #SBATCH --output=array_job_%A_task_%a.out #SBATCH --error=array_job_%A_task_%a.err ## Command(s) to run: echo "I am task $SLURM_ARRAY_TASK_ID" == CGROUPS and Resource Management == Our installation of Slurm will utilize Linux CGROUPS, which puts a hard resource cap on jobs. If you define that your job will need 4GB of RAM, and it uses 5GB, it will fail with an OOM exception. Likewise with CPU or GPU resources; if your job ends up using more than you specify, it will fail. Or the "--time" batch file option, your job will fail if it takes longer than what you specify there. This is to keep the nodes from crashing from runaway jobs that use more resources than you think they will. So... TEST YOUR JOBS! Find our how much in resources a single job needs before you launch 100 of them. bc5ecf486ec10d065c3b6eaa768b363198c5e2cf 338 302 2023-06-12T22:11:37Z Weiler 3 wikitext text/x-wiki When using Slurm, you will need to log into the Slurm head node (currently phoenix.prism). Once you have ssh'd in there, you can execute slurm batch or interactive commands. You might also want to consult the [[Quick Reference Guide]]. == Submit a Slurm Batch Job == In order to submit a Slurm batch job list, you will need to create a directory that you will have read and write access to on all the nodes (which will often be a shared space). Let's say I have a batch named "experiment-1". I would create that directory in my groups area: % mkdir -p /private/groups/clusteradmin/weiler/slurm-jobs/experiment-1 % cd /private/groups/clusteradmin/weiler/slurm-jobs/experiment-1 Then you will need to create your job submission batch file. It will look something like this. My file is called 'slurm-test.sh': % vim slurm-test.sh Then populate the file as necessary: #!/bin/bash # Job name: #SBATCH --job-name=weiler_test # # Partition - This is the queue it goes in: #SBATCH --partition=main # # Where to send email (optional) #SBATCH --mail-user=weiler@ucsc.edu # # Number of nodes you need per job: #SBATCH --nodes=1 # # Memory needed for the jobs. Try very hard to make this accurate. DEFAULT = 4gb #SBATCH --mem=4gb # # Number of tasks (one for each GPU desired for use case) (example): #SBATCH --ntasks=1 # # Processors per task: # At least eight times the number of GPUs needed for nVidia RTX A5500 #SBATCH --cpus-per-task=1 # # Number of GPUs, this can be in the format of "--gres=gpu:[1-8]", or "--gres=gpu:A5500:[1-8]" with the type included (optional) #SBATCH --gres=gpu:1 # # Standard output and error log #SBATCH --output=serial_test_%j.log # # Wall clock limit in hrs:min:sec: #SBATCH --time=00:00:30 # ## Command(s) to run (example): pwd; hostname; date echo "Running test script on a single CPU core" sleep 5 echo "Test done!" date Keep the "SBATCH" lines commented, the scheduler will read them anyway. If you don't need a particular option, just don't include it in the file. To submit the batch job: % sbatch slurm-test.sh Submitted batch job 7 The job(s) will then be scheduled. You can see the state of the queue as such: % squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 7 batch weiler_t weiler R 0:07 1 phoenix-01 The job will output any STDOUT or STDERR in the directory you launched the job from. Other than that, it will do whatever the job does, even if there is no STDOUT. == Launching Several Jobs at Once == You can launch many jobs at once using the $SLURM_ARRAY_TASK_ID variable. Add something like the following to your batch submission file: #SBATCH --array=0-31 #SBATCH --output=array_job_%A_task_%a.out #SBATCH --error=array_job_%A_task_%a.err ## Command(s) to run: echo "I am task $SLURM_ARRAY_TASK_ID" == CGROUPS and Resource Management == Our installation of Slurm will utilize Linux CGROUPS, which puts a hard resource cap on jobs. If you define that your job will need 4GB of RAM, and it uses 5GB, it will fail with an OOM exception. Likewise with CPU or GPU resources; if your job ends up using more than you specify, it will fail. Or the "--time" batch file option, your job will fail if it takes longer than what you specify there. This is to keep the nodes from crashing from runaway jobs that use more resources than you think they will. So... TEST YOUR JOBS! Find our how much in resources a single job needs before you launch 100 of them. == TEST YOUR JOBS! == Let me say that one more time. Test your jobs before launching a bunch of them! If it fails, you don't want it to fail 100 or more times. You can also get a good idea of how much RAM and CPU it will need so you can better define your batch files. It's critical to get a good idea of how many resources each of your jobs will use and define your job file appropriately. 86b10d1b962ac310edb6dc3f796e39697c65982a Requirement for users to get GI VPN access 0 9 303 263 2023-05-15T00:42:51Z Weiler 3 wikitext text/x-wiki Before you are allowed access to our firewalled/secure area ("Prism" or CIRM"), you have to complete 3 items and provide the completed certificates or forms: '''1''': You must take and complete the NIH Public Security Refresher Course online. You must complete the course in a single continuous sitting: https://irtsectraining.nih.gov/FYR/00_005.aspx The course is titled "2022 Information Security and Management Refresher". At the end you will be able to print out or save the completion certificate that should have your name on it. '''2''': You need to print and sign the Genomics Institute VPN User Agreement, located here for download: [[Media:GI_VPN_Policy.pdf]] '''3''': Please print, read and sign the last page of the NIH Genomic Data Sharing Policy agreement, located here for download. By signing the document you agree that you have read and understand the policies described therein and that you agree to abide by those policies: [[Media:NIH_GDS_Policy.pdf]] When you have the three documents described above ready, please complete this form: https://app.smartsheet.com/b/form/a76dbd90ba0240ab9ea9d39b390586ce. There are two parts in this process. 1. For the user, please fill in ALL required fields '''and attach''' all three required documents described above. 2. For the Sponsor/PI - you will receive an email from Smartsheet. Please fill in all required fields and submit. We will receive your completed request and we will create your account and go over the details via a short zoom meeting with you. You will need access to the "eduroam" wireless network '''prior''' to your zoom appointment if you are on campus. Other UCSC wireless networks such as "cruznet" will not work with our VPN software, so please make sure your laptop works with eduroam before coming to your appointment. Instructions on how to get on eduroam are detailed here: https://its.ucsc.edu/wireless/eduroam.html When using the VPN software off-campus, it will usually work unless the wireless network you are on has restrictions preventing it from functioning. Some other universities have such restrictions (notably UCSF), but most other wireless network and home wireless networks should work fine. Before your appointment, please make sure you install the appropriate OpenVPN software on your laptop: A laptop running OS X, Windows or Ubuntu For Macs, please download and install '''Tunnelblick''' from https://tunnelblick.net/downloads.html. Select the Latest Stable version. For Windows, please download and install '''OpenVPN Client''' from https://openvpn.net/index.php/open-source/downloads.html. Select ''openvpn-install-x.x.x-xxxx.exe'' For Ubuntu, please install network-manager-openvpn by typing: sudo apt-get install network-manager-openvpn network-manager-openvpn-gnome Please do NOT worry about how to configure the software at this point. We will help you to set it up at your appointment. We will correspond with you via email on when the appointment will be. The zoom meeting can take up to 30 minutes per person depending on whether or not any issues come up during the software setup. If you show up for your appointment without one (or more) of the above outlined requirements, we will have to reschedule your appointment for a time when you can arrive after completing the above requirements. '''PLEASE NOTE:''' Because of the overhead required in setting up VPN access, please only request access if you have an immediate need to work on data that exists behind the firewall. We have had a decent number of people request access and go through the setup but then never use it. In other words, please do not request access because "one day you might need it", but because you '''do''' actually need it! '''ALSO NOTE:''' VPN accounts expire after one year from the date of first gaining access. To renew for another year you will need your PI/sponsor to send us a note asking for renewal. 29b1b6beac1cfcf48e6c4198712e0daabf64ed80 304 303 2023-05-15T00:45:52Z Weiler 3 wikitext text/x-wiki Before you are allowed access to our firewalled/secure area ("Prism" or CIRM"), you have to complete 3 items and provide the completed certificates or forms: '''1''': You must take and complete the NIH Public Security Refresher Course online. You must complete the course in a single continuous sitting: https://irtsectraining.nih.gov/public.aspx Click on the "Enter Public Training Portal" near the bottom of the page. The course is titled "2022 Information Security, Insider Threats, Privacy Awareness, Records Management and Emergency Preparedness Refresher". At the end you will be able to print out or save the completion certificate that should have your name on it. '''2''': You need to print and sign the Genomics Institute VPN User Agreement, located here for download: [[Media:GI_VPN_Policy.pdf]] '''3''': Please print, read and sign the last page of the NIH Genomic Data Sharing Policy agreement, located here for download. By signing the document you agree that you have read and understand the policies described therein and that you agree to abide by those policies: [[Media:NIH_GDS_Policy.pdf]] When you have the three documents described above ready, please complete this form: https://app.smartsheet.com/b/form/a76dbd90ba0240ab9ea9d39b390586ce. There are two parts in this process. 1. For the user, please fill in ALL required fields '''and attach''' all three required documents described above. 2. For the Sponsor/PI - you will receive an email from Smartsheet. Please fill in all required fields and submit. We will receive your completed request and we will create your account and go over the details via a short zoom meeting with you. You will need access to the "eduroam" wireless network '''prior''' to your zoom appointment if you are on campus. Other UCSC wireless networks such as "cruznet" will not work with our VPN software, so please make sure your laptop works with eduroam before coming to your appointment. Instructions on how to get on eduroam are detailed here: https://its.ucsc.edu/wireless/eduroam.html When using the VPN software off-campus, it will usually work unless the wireless network you are on has restrictions preventing it from functioning. Some other universities have such restrictions (notably UCSF), but most other wireless network and home wireless networks should work fine. Before your appointment, please make sure you install the appropriate OpenVPN software on your laptop: A laptop running OS X, Windows or Ubuntu For Macs, please download and install '''Tunnelblick''' from https://tunnelblick.net/downloads.html. Select the Latest Stable version. For Windows, please download and install '''OpenVPN Client''' from https://openvpn.net/index.php/open-source/downloads.html. Select ''openvpn-install-x.x.x-xxxx.exe'' For Ubuntu, please install network-manager-openvpn by typing: sudo apt-get install network-manager-openvpn network-manager-openvpn-gnome Please do NOT worry about how to configure the software at this point. We will help you to set it up at your appointment. We will correspond with you via email on when the appointment will be. The zoom meeting can take up to 30 minutes per person depending on whether or not any issues come up during the software setup. If you show up for your appointment without one (or more) of the above outlined requirements, we will have to reschedule your appointment for a time when you can arrive after completing the above requirements. '''PLEASE NOTE:''' Because of the overhead required in setting up VPN access, please only request access if you have an immediate need to work on data that exists behind the firewall. We have had a decent number of people request access and go through the setup but then never use it. In other words, please do not request access because "one day you might need it", but because you '''do''' actually need it! '''ALSO NOTE:''' VPN accounts expire after one year from the date of first gaining access. To renew for another year you will need your PI/sponsor to send us a note asking for renewal. b21b4bc8b67c6701a48dd95005d89f14383802f0 305 304 2023-05-15T00:57:13Z Weiler 3 wikitext text/x-wiki Before you are allowed access to our firewalled/secure area ("Prism" or CIRM"), you have to complete 3 items and provide the completed certificates or forms: '''1''': You must take and complete the NIH Public Security Refresher Course online. You must complete the course in a single continuous sitting: https://irtsectraining.nih.gov/public.aspx Click on the "Enter Public Training Portal" near the bottom of the page. The course is titled "2022 Information Security, Insider Threats, Privacy Awareness, Records Management and Emergency Preparedness Refresher". At the end you will be able to print out or save the completion certificate that should have your name on it. '''2''': You need to sign the Genomics Institute VPN User Agreement (digital signature OK), located here for download: [[Media:GI_VPN_Policy.pdf]] '''3''': Please print, read and sign the last page of the NIH Genomic Data Sharing Policy agreement, located here for download. By signing the document you agree that you have read and understand the policies described therein and that you agree to abide by those policies: [[Media:NIH_GDS_Policy.pdf]] When you have the three documents described above ready, please complete this form: https://app.smartsheet.com/b/form/a76dbd90ba0240ab9ea9d39b390586ce. There are two parts in this process. 1. For the user, please fill in ALL required fields '''and attach''' all three required documents described above. 2. For the Sponsor/PI - you will receive an email from Smartsheet. Please fill in all required fields and submit. We will receive your completed request and we will create your account and go over the details via a short zoom meeting with you. You will need access to the "eduroam" wireless network '''prior''' to your zoom appointment if you are on campus. Other UCSC wireless networks such as "cruznet" will not work with our VPN software, so please make sure your laptop works with eduroam before coming to your appointment. Instructions on how to get on eduroam are detailed here: https://its.ucsc.edu/wireless/eduroam.html When using the VPN software off-campus, it will usually work unless the wireless network you are on has restrictions preventing it from functioning. Some other universities have such restrictions (notably UCSF), but most other wireless network and home wireless networks should work fine. Before your appointment, please make sure you install the appropriate OpenVPN software on your laptop: A laptop running OS X, Windows or Ubuntu For Macs, please download and install '''Tunnelblick''' from https://tunnelblick.net/downloads.html. Select the Latest Stable version. For Windows, please download and install '''OpenVPN Client''' from https://openvpn.net/index.php/open-source/downloads.html. Select ''openvpn-install-x.x.x-xxxx.exe'' For Ubuntu, please install network-manager-openvpn by typing: sudo apt-get install network-manager-openvpn network-manager-openvpn-gnome Please do NOT worry about how to configure the software at this point. We will help you to set it up at your appointment. We will correspond with you via email on when the appointment will be. The zoom meeting can take up to 30 minutes per person depending on whether or not any issues come up during the software setup. If you show up for your appointment without one (or more) of the above outlined requirements, we will have to reschedule your appointment for a time when you can arrive after completing the above requirements. '''PLEASE NOTE:''' Because of the overhead required in setting up VPN access, please only request access if you have an immediate need to work on data that exists behind the firewall. We have had a decent number of people request access and go through the setup but then never use it. In other words, please do not request access because "one day you might need it", but because you '''do''' actually need it! '''ALSO NOTE:''' VPN accounts expire after one year from the date of first gaining access. To renew for another year you will need your PI/sponsor to send us a note asking for renewal. 15a9c26344392b722345de41bdd2a51e3d6600aa 306 305 2023-05-15T00:57:56Z Weiler 3 wikitext text/x-wiki Before you are allowed access to our firewalled/secure area ("Prism" or CIRM"), you have to complete 3 items and provide the completed certificates or forms: '''1''': You must take and complete the NIH Public Security Refresher Course online. You must complete the course in a single continuous sitting: https://irtsectraining.nih.gov/public.aspx Click on the "Enter Public Training Portal" near the bottom of the page. The course is titled "2022 Information Security, Insider Threats, Privacy Awareness, Records Management and Emergency Preparedness Refresher". At the end you will be able to save the completion certificate that should have your name on it. '''2''': You need to sign the Genomics Institute VPN User Agreement (digital signature OK), located here for download: [[Media:GI_VPN_Policy.pdf]] '''3''': Please read and sign the last page of the NIH Genomic Data Sharing Policy agreement (digital signature OK), located here for download. By signing the document you agree that you have read and understand the policies described therein and that you agree to abide by those policies: [[Media:NIH_GDS_Policy.pdf]] When you have the three documents described above ready, please complete this form: https://app.smartsheet.com/b/form/a76dbd90ba0240ab9ea9d39b390586ce. There are two parts in this process. 1. For the user, please fill in ALL required fields '''and attach''' all three required documents described above. 2. For the Sponsor/PI - you will receive an email from Smartsheet. Please fill in all required fields and submit. We will receive your completed request and we will create your account and go over the details via a short zoom meeting with you. You will need access to the "eduroam" wireless network '''prior''' to your zoom appointment if you are on campus. Other UCSC wireless networks such as "cruznet" will not work with our VPN software, so please make sure your laptop works with eduroam before coming to your appointment. Instructions on how to get on eduroam are detailed here: https://its.ucsc.edu/wireless/eduroam.html When using the VPN software off-campus, it will usually work unless the wireless network you are on has restrictions preventing it from functioning. Some other universities have such restrictions (notably UCSF), but most other wireless network and home wireless networks should work fine. Before your appointment, please make sure you install the appropriate OpenVPN software on your laptop: A laptop running OS X, Windows or Ubuntu For Macs, please download and install '''Tunnelblick''' from https://tunnelblick.net/downloads.html. Select the Latest Stable version. For Windows, please download and install '''OpenVPN Client''' from https://openvpn.net/index.php/open-source/downloads.html. Select ''openvpn-install-x.x.x-xxxx.exe'' For Ubuntu, please install network-manager-openvpn by typing: sudo apt-get install network-manager-openvpn network-manager-openvpn-gnome Please do NOT worry about how to configure the software at this point. We will help you to set it up at your appointment. We will correspond with you via email on when the appointment will be. The zoom meeting can take up to 30 minutes per person depending on whether or not any issues come up during the software setup. If you show up for your appointment without one (or more) of the above outlined requirements, we will have to reschedule your appointment for a time when you can arrive after completing the above requirements. '''PLEASE NOTE:''' Because of the overhead required in setting up VPN access, please only request access if you have an immediate need to work on data that exists behind the firewall. We have had a decent number of people request access and go through the setup but then never use it. In other words, please do not request access because "one day you might need it", but because you '''do''' actually need it! '''ALSO NOTE:''' VPN accounts expire after one year from the date of first gaining access. To renew for another year you will need your PI/sponsor to send us a note asking for renewal. 35dc5b224e4a895a5f9d290cada460cc16566d09 GPU Resources 0 36 307 299 2023-05-15T18:44:00Z Anovak 4 Explain how to actually use GPUs and what won't work wikitext text/x-wiki When submitting jobs, you can ask for GPUs in one of two ways. One is: #SBATCH --gres=gpu:1 That will ask for 1 GPU generically on a node with a free GPU. This request is more specific: #SBATCH --gres=gpu:A5500:3 That requests 3 A5500 GPUs '''only'''. We have several GPU types on the cluster which may fit your specific needs: nVidia RTX A5500 : 24GB RAM nVidia A100 : 80GB RAM nVidia GeForce RTX 2080 Ti : 11GB RAM nVidia GeForce RTX 1080 Ti : 11GB RAM ==Using GPUs== The Slurm cluster nodes have the nVidia drivers installed, as well as basic CUDA tools like nvidia-smi, but they do not have the full CUDA Toolkit, and they do not have the nvcc CUDA-enabled compiler. If you want to compile applications that use CUDA, you will need to install the development environment yourself for your user. To actually use a GPU, you need to run a program that uses the CUDA API. Some projects, such as tensorflow, may ship pre-built binaries that can use CUDA without needing a compiler. You can also run containers on the cluster using Singularity, and give them access to GPUs using the --nv option. For example: singularity pull docker://tensorflow/tensorflow:latest-gpu srun -c 8 --mem 10G --gres=gpu:1 --exclude phoenix-01 /usr/bin/singularity run --nv docker://tensorflow/tensorflow:latest-gpu python -c 'from tensorflow.python.client import device_lib; print(device_lib.list_local_devices())' This will produce output showing that the Tensorflow container is indeed able to talk to one GPU: INFO: Using cached SIF image 2023-05-15 11:36:33.110850: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-05-15 11:36:38.799035: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /device:GPU:0 with 22244 MB memory: -> device: 0, name: NVIDIA RTX A5500, pci bus id: 0000:03:00.0, compute capability: 8.6 [name: "/device:CPU:0" device_type: "CPU" memory_limit: 268435456 locality { } incarnation: 8527638019084870106 xla_global_id: -1 , name: "/device:GPU:0" device_type: "GPU" memory_limit: 23324655616 locality { bus_id: 1 links { } } incarnation: 1860154623440434360 physical_device_desc: "device: 0, name: NVIDIA RTX A5500, pci bus id: 0000:03:00.0, compute capability: 8.6" xla_global_id: 416903419 ] You might be used to running containers with Docker, or containerized GPU workloads with the nVidia Container Runtime or Toolkit. Unfortunately, the Docker daemon is not installed on the cluster nodes, and so Docker is not available. Instead, Singularity can run most Docker containers, without requiring users to be able to manipulate a highly-privileged daemon. Slurm itself also supports a --container option for jobs, which allows a whole job to be run inside a container. If you are able to convert your container to OCI Bundle format, you can pass it directly to Slurm instead of using Singularity from inside the job. However, Docker-compatible image specifiers can't be given to Slurm, only paths to OCI bundles on disk, and the tools to download a Docker image from Docker Hub in OCI bundle format (skopeo and umoci) are not installed on the cluster. 401d8811c76dbb0b9cfc897f8c36d7a8970b433c 308 307 2023-05-15T18:52:52Z Anovak 4 Divide up by method wikitext text/x-wiki When submitting jobs, you can ask for GPUs in one of two ways. One is: #SBATCH --gres=gpu:1 That will ask for 1 GPU generically on a node with a free GPU. This request is more specific: #SBATCH --gres=gpu:A5500:3 That requests 3 A5500 GPUs '''only'''. We have several GPU types on the cluster which may fit your specific needs: nVidia RTX A5500 : 24GB RAM nVidia A100 : 80GB RAM nVidia GeForce RTX 2080 Ti : 11GB RAM nVidia GeForce RTX 1080 Ti : 11GB RAM ==Running GPU Workloads== To actually use an nVidia GPU, you need to run a program that uses the CUDA API. There are a few ways to obtain such a program. ===Prebuilt CUDA Applications=== The Slurm cluster nodes have the nVidia drivers installed, as well as basic CUDA tools like nvidia-smi. Some projects, such as tensorflow, may ship pre-built binaries that can use CUDA. You should be able to run these binaries directly, if you download them. ===Building CUDA Applications=== The cluster nodes do not have the full CUDA Toolkit. In particular, they do not have the '''nvcc''' CUDA-enabled compiler. If you want to compile applications that use CUDA, you will need to install the development environment yourself for your user. Once you have '''nvcc''' available to your user, building CUDA applications should work. To run them, you will have to submit them as jobs, because the head node does not have a GPU. ===Containerized GPU Workloads=== You can also run containers on the cluster using Singularity, and give them access to GPUs using the --nv option. For example: singularity pull docker://tensorflow/tensorflow:latest-gpu srun -c 8 --mem 10G --gres=gpu:1 --exclude phoenix-01 /usr/bin/singularity run --nv docker://tensorflow/tensorflow:latest-gpu python -c 'from tensorflow.python.client import device_lib; print(device_lib.list_local_devices())' This will produce output showing that the Tensorflow container is indeed able to talk to one GPU: INFO: Using cached SIF image 2023-05-15 11:36:33.110850: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-05-15 11:36:38.799035: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /device:GPU:0 with 22244 MB memory: -> device: 0, name: NVIDIA RTX A5500, pci bus id: 0000:03:00.0, compute capability: 8.6 [name: "/device:CPU:0" device_type: "CPU" memory_limit: 268435456 locality { } incarnation: 8527638019084870106 xla_global_id: -1 , name: "/device:GPU:0" device_type: "GPU" memory_limit: 23324655616 locality { bus_id: 1 links { } } incarnation: 1860154623440434360 physical_device_desc: "device: 0, name: NVIDIA RTX A5500, pci bus id: 0000:03:00.0, compute capability: 8.6" xla_global_id: 416903419 ] ====Running Containers in Slurm==== Slurm itself also supports a '''--container''' option for jobs, which allows a whole job to be run inside a container. If you are able to [https://slurm.schedmd.com/containers.html convert your container to OCI Bundle format], you can pass it directly to Slurm instead of using Singularity from inside the job. However, Docker-compatible image specifiers can't be given to Slurm, only paths to OCI bundles on disk, and the tools to download a Docker image from Docker Hub in OCI bundle format ('''skopeo''' and '''umoci''') are not installed on the cluster. Slurm containers ''should'' have access to their assigned GPUs, but it is not clear if tools like '''nvidia-smi''' are injected into the container, as they would be with Singularity or the nVidia Container Runtime. ====Running Containers in Docker==== You might be used to running containers with Docker, or containerized GPU workloads with the nVidia Container Runtime or Toolkit. Unfortunately, the Docker daemon is not installed on the cluster nodes, and so Docker is not available. Instead, Singularity can run most Docker containers, without requiring users to be able to manipulate a highly-privileged daemon. 5e140883fff43dc43722bda4e06d390ba436bfc7 309 308 2023-05-15T18:54:53Z Anovak 4 /* Containerized GPU Workloads */ wikitext text/x-wiki When submitting jobs, you can ask for GPUs in one of two ways. One is: #SBATCH --gres=gpu:1 That will ask for 1 GPU generically on a node with a free GPU. This request is more specific: #SBATCH --gres=gpu:A5500:3 That requests 3 A5500 GPUs '''only'''. We have several GPU types on the cluster which may fit your specific needs: nVidia RTX A5500 : 24GB RAM nVidia A100 : 80GB RAM nVidia GeForce RTX 2080 Ti : 11GB RAM nVidia GeForce RTX 1080 Ti : 11GB RAM ==Running GPU Workloads== To actually use an nVidia GPU, you need to run a program that uses the CUDA API. There are a few ways to obtain such a program. ===Prebuilt CUDA Applications=== The Slurm cluster nodes have the nVidia drivers installed, as well as basic CUDA tools like nvidia-smi. Some projects, such as tensorflow, may ship pre-built binaries that can use CUDA. You should be able to run these binaries directly, if you download them. ===Building CUDA Applications=== The cluster nodes do not have the full CUDA Toolkit. In particular, they do not have the '''nvcc''' CUDA-enabled compiler. If you want to compile applications that use CUDA, you will need to install the development environment yourself for your user. Once you have '''nvcc''' available to your user, building CUDA applications should work. To run them, you will have to submit them as jobs, because the head node does not have a GPU. ===Containerized GPU Workloads=== Instead of directly installing binaries, or installing and using the CUDA Toolkit, it is often easiest to use containers to download a prebuilt GPU workload and all of its libraries and dependencies. THere are a few options for running containerized GPU workloads on the cluster. ====Running Containers in Singularity==== You can run containers on the cluster using Singularity, and give them access to GPUs using the '''--nv''' option. For example: singularity pull docker://tensorflow/tensorflow:latest-gpu srun -c 8 --mem 10G --gres=gpu:1 --exclude phoenix-01 /usr/bin/singularity run --nv docker://tensorflow/tensorflow:latest-gpu python -c 'from tensorflow.python.client import device_lib; print(device_lib.list_local_devices())' This will produce output showing that the Tensorflow container is indeed able to talk to one GPU: INFO: Using cached SIF image 2023-05-15 11:36:33.110850: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-05-15 11:36:38.799035: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /device:GPU:0 with 22244 MB memory: -> device: 0, name: NVIDIA RTX A5500, pci bus id: 0000:03:00.0, compute capability: 8.6 [name: "/device:CPU:0" device_type: "CPU" memory_limit: 268435456 locality { } incarnation: 8527638019084870106 xla_global_id: -1 , name: "/device:GPU:0" device_type: "GPU" memory_limit: 23324655616 locality { bus_id: 1 links { } } incarnation: 1860154623440434360 physical_device_desc: "device: 0, name: NVIDIA RTX A5500, pci bus id: 0000:03:00.0, compute capability: 8.6" xla_global_id: 416903419 ] ====Running Containers in Slurm==== Slurm itself also supports a '''--container''' option for jobs, which allows a whole job to be run inside a container. If you are able to [https://slurm.schedmd.com/containers.html convert your container to OCI Bundle format], you can pass it directly to Slurm instead of using Singularity from inside the job. However, Docker-compatible image specifiers can't be given to Slurm, only paths to OCI bundles on disk, and the tools to download a Docker image from Docker Hub in OCI bundle format ('''skopeo''' and '''umoci''') are not installed on the cluster. Slurm containers ''should'' have access to their assigned GPUs, but it is not clear if tools like '''nvidia-smi''' are injected into the container, as they would be with Singularity or the nVidia Container Runtime. ====Running Containers in Docker==== You might be used to running containers with Docker, or containerized GPU workloads with the nVidia Container Runtime or Toolkit. Unfortunately, the Docker daemon is not installed on the cluster nodes, and so Docker is not available. Instead, Singularity can run most Docker containers, without requiring users to be able to manipulate a highly-privileged daemon. fdd818aec649e853187067ab810d310deb0cb371 310 309 2023-05-15T19:07:16Z Anovak 4 wikitext text/x-wiki When submitting jobs, you can ask for GPUs in one of two ways. One is: #SBATCH --gres=gpu:1 That will ask for 1 GPU generically on a node with a free GPU. This request is more specific: #SBATCH --gres=gpu:A5500:3 That requests 3 A5500 GPUs '''only'''. We have several GPU types on the cluster which may fit your specific needs: nVidia RTX A5500 : 24GB RAM nVidia A100 : 80GB RAM nVidia GeForce RTX 2080 Ti : 11GB RAM nVidia GeForce RTX 1080 Ti : 11GB RAM ==Running GPU Workloads== To actually use an nVidia GPU, you need to run a program that uses the CUDA API. There are a few ways to obtain such a program. ===Prebuilt CUDA Applications=== The Slurm cluster nodes have the nVidia drivers installed, as well as basic CUDA tools like nvidia-smi. Some projects, such as tensorflow, may ship pre-built binaries that can use CUDA. You should be able to run these binaries directly, if you download them. ===Building CUDA Applications=== The cluster nodes do not have the full CUDA Toolkit. In particular, they do not have the '''nvcc''' CUDA-enabled compiler. If you want to compile applications that use CUDA, you will need to install the development environment yourself for your user. Once you have '''nvcc''' available to your user, building CUDA applications should work. To run them, you will have to submit them as jobs, because the head node does not have a GPU. ===Containerized GPU Workloads=== Instead of directly installing binaries, or installing and using the CUDA Toolkit, it is often easiest to use containers to download a prebuilt GPU workload and all of its libraries and dependencies. THere are a few options for running containerized GPU workloads on the cluster. ====Running Containers in Singularity==== You can run containers on the cluster using Singularity, and give them access to GPUs using the '''--nv''' option. For example: singularity pull docker://tensorflow/tensorflow:latest-gpu srun -c 8 --mem 10G --gres=gpu:1 --exclude phoenix-01 singularity run --nv docker://tensorflow/tensorflow:latest-gpu python -c 'from tensorflow.python.client import device_lib; print(device_lib.list_local_devices())' This will produce output showing that the Tensorflow container is indeed able to talk to one GPU: INFO: Using cached SIF image 2023-05-15 11:36:33.110850: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-05-15 11:36:38.799035: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /device:GPU:0 with 22244 MB memory: -> device: 0, name: NVIDIA RTX A5500, pci bus id: 0000:03:00.0, compute capability: 8.6 [name: "/device:CPU:0" device_type: "CPU" memory_limit: 268435456 locality { } incarnation: 8527638019084870106 xla_global_id: -1 , name: "/device:GPU:0" device_type: "GPU" memory_limit: 23324655616 locality { bus_id: 1 links { } } incarnation: 1860154623440434360 physical_device_desc: "device: 0, name: NVIDIA RTX A5500, pci bus id: 0000:03:00.0, compute capability: 8.6" xla_global_id: 416903419 ] ====Running Containers in Slurm==== Slurm itself also supports a '''--container''' option for jobs, which allows a whole job to be run inside a container. If you are able to [https://slurm.schedmd.com/containers.html convert your container to OCI Bundle format], you can pass it directly to Slurm instead of using Singularity from inside the job. However, Docker-compatible image specifiers can't be given to Slurm, only paths to OCI bundles on disk, and the tools to download a Docker image from Docker Hub in OCI bundle format ('''skopeo''' and '''umoci''') are not installed on the cluster. Slurm containers ''should'' have access to their assigned GPUs, but it is not clear if tools like '''nvidia-smi''' are injected into the container, as they would be with Singularity or the nVidia Container Runtime. ====Running Containers in Docker==== You might be used to running containers with Docker, or containerized GPU workloads with the nVidia Container Runtime or Toolkit. Unfortunately, the Docker daemon is not installed on the cluster nodes, and so Docker is not available. Instead, Singularity can run most Docker containers, without requiring users to be able to manipulate a highly-privileged daemon. 5e2acad41af012d543c1fe2611f7d1b4c5da3615 311 310 2023-05-15T19:37:30Z Anovak 4 /* Running Containers in Docker */ wikitext text/x-wiki When submitting jobs, you can ask for GPUs in one of two ways. One is: #SBATCH --gres=gpu:1 That will ask for 1 GPU generically on a node with a free GPU. This request is more specific: #SBATCH --gres=gpu:A5500:3 That requests 3 A5500 GPUs '''only'''. We have several GPU types on the cluster which may fit your specific needs: nVidia RTX A5500 : 24GB RAM nVidia A100 : 80GB RAM nVidia GeForce RTX 2080 Ti : 11GB RAM nVidia GeForce RTX 1080 Ti : 11GB RAM ==Running GPU Workloads== To actually use an nVidia GPU, you need to run a program that uses the CUDA API. There are a few ways to obtain such a program. ===Prebuilt CUDA Applications=== The Slurm cluster nodes have the nVidia drivers installed, as well as basic CUDA tools like nvidia-smi. Some projects, such as tensorflow, may ship pre-built binaries that can use CUDA. You should be able to run these binaries directly, if you download them. ===Building CUDA Applications=== The cluster nodes do not have the full CUDA Toolkit. In particular, they do not have the '''nvcc''' CUDA-enabled compiler. If you want to compile applications that use CUDA, you will need to install the development environment yourself for your user. Once you have '''nvcc''' available to your user, building CUDA applications should work. To run them, you will have to submit them as jobs, because the head node does not have a GPU. ===Containerized GPU Workloads=== Instead of directly installing binaries, or installing and using the CUDA Toolkit, it is often easiest to use containers to download a prebuilt GPU workload and all of its libraries and dependencies. THere are a few options for running containerized GPU workloads on the cluster. ====Running Containers in Singularity==== You can run containers on the cluster using Singularity, and give them access to GPUs using the '''--nv''' option. For example: singularity pull docker://tensorflow/tensorflow:latest-gpu srun -c 8 --mem 10G --gres=gpu:1 --exclude phoenix-01 singularity run --nv docker://tensorflow/tensorflow:latest-gpu python -c 'from tensorflow.python.client import device_lib; print(device_lib.list_local_devices())' This will produce output showing that the Tensorflow container is indeed able to talk to one GPU: INFO: Using cached SIF image 2023-05-15 11:36:33.110850: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-05-15 11:36:38.799035: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /device:GPU:0 with 22244 MB memory: -> device: 0, name: NVIDIA RTX A5500, pci bus id: 0000:03:00.0, compute capability: 8.6 [name: "/device:CPU:0" device_type: "CPU" memory_limit: 268435456 locality { } incarnation: 8527638019084870106 xla_global_id: -1 , name: "/device:GPU:0" device_type: "GPU" memory_limit: 23324655616 locality { bus_id: 1 links { } } incarnation: 1860154623440434360 physical_device_desc: "device: 0, name: NVIDIA RTX A5500, pci bus id: 0000:03:00.0, compute capability: 8.6" xla_global_id: 416903419 ] ====Running Containers in Slurm==== Slurm itself also supports a '''--container''' option for jobs, which allows a whole job to be run inside a container. If you are able to [https://slurm.schedmd.com/containers.html convert your container to OCI Bundle format], you can pass it directly to Slurm instead of using Singularity from inside the job. However, Docker-compatible image specifiers can't be given to Slurm, only paths to OCI bundles on disk, and the tools to download a Docker image from Docker Hub in OCI bundle format ('''skopeo''' and '''umoci''') are not installed on the cluster. Slurm containers ''should'' have access to their assigned GPUs, but it is not clear if tools like '''nvidia-smi''' are injected into the container, as they would be with Singularity or the nVidia Container Runtime. ====Running Containers in Docker==== You might be used to running containers with Docker, or containerized GPU workloads with the nVidia Container Runtime or Toolkit. Docker is installed on all the nodes and the daemon is running; if the `docker` command does not work for you, ask cluster-admin to add you to the right groups. We are working on getting the nVidia runtime set up to make the '''--gpus''' option to '''docker run''' work. f19a18b08884f72c5ae0c935fd1faaad93e90eee 312 311 2023-05-15T19:45:19Z Anovak 4 /* Running Containers in Slurm */ wikitext text/x-wiki When submitting jobs, you can ask for GPUs in one of two ways. One is: #SBATCH --gres=gpu:1 That will ask for 1 GPU generically on a node with a free GPU. This request is more specific: #SBATCH --gres=gpu:A5500:3 That requests 3 A5500 GPUs '''only'''. We have several GPU types on the cluster which may fit your specific needs: nVidia RTX A5500 : 24GB RAM nVidia A100 : 80GB RAM nVidia GeForce RTX 2080 Ti : 11GB RAM nVidia GeForce RTX 1080 Ti : 11GB RAM ==Running GPU Workloads== To actually use an nVidia GPU, you need to run a program that uses the CUDA API. There are a few ways to obtain such a program. ===Prebuilt CUDA Applications=== The Slurm cluster nodes have the nVidia drivers installed, as well as basic CUDA tools like nvidia-smi. Some projects, such as tensorflow, may ship pre-built binaries that can use CUDA. You should be able to run these binaries directly, if you download them. ===Building CUDA Applications=== The cluster nodes do not have the full CUDA Toolkit. In particular, they do not have the '''nvcc''' CUDA-enabled compiler. If you want to compile applications that use CUDA, you will need to install the development environment yourself for your user. Once you have '''nvcc''' available to your user, building CUDA applications should work. To run them, you will have to submit them as jobs, because the head node does not have a GPU. ===Containerized GPU Workloads=== Instead of directly installing binaries, or installing and using the CUDA Toolkit, it is often easiest to use containers to download a prebuilt GPU workload and all of its libraries and dependencies. THere are a few options for running containerized GPU workloads on the cluster. ====Running Containers in Singularity==== You can run containers on the cluster using Singularity, and give them access to GPUs using the '''--nv''' option. For example: singularity pull docker://tensorflow/tensorflow:latest-gpu srun -c 8 --mem 10G --gres=gpu:1 --exclude phoenix-01 singularity run --nv docker://tensorflow/tensorflow:latest-gpu python -c 'from tensorflow.python.client import device_lib; print(device_lib.list_local_devices())' This will produce output showing that the Tensorflow container is indeed able to talk to one GPU: INFO: Using cached SIF image 2023-05-15 11:36:33.110850: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-05-15 11:36:38.799035: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /device:GPU:0 with 22244 MB memory: -> device: 0, name: NVIDIA RTX A5500, pci bus id: 0000:03:00.0, compute capability: 8.6 [name: "/device:CPU:0" device_type: "CPU" memory_limit: 268435456 locality { } incarnation: 8527638019084870106 xla_global_id: -1 , name: "/device:GPU:0" device_type: "GPU" memory_limit: 23324655616 locality { bus_id: 1 links { } } incarnation: 1860154623440434360 physical_device_desc: "device: 0, name: NVIDIA RTX A5500, pci bus id: 0000:03:00.0, compute capability: 8.6" xla_global_id: 416903419 ] ====Running Containers in Slurm==== Slurm itself also supports a '''--container''' option for jobs, which allows a whole job to be run inside a container. If you are able to [https://slurm.schedmd.com/containers.html convert your container to OCI Bundle format], you can pass it directly to Slurm instead of using Singularity from inside the job. However, Docker-compatible image specifiers can't be given to Slurm, only paths to OCI bundles on disk. Stnad-alone tools to download a Docker image from Docker Hub in OCI bundle format ('''skopeo''' and '''umoci''') are not yet installed on the cluster. But the method using the '''docker''' command should work. Slurm containers ''should'' have access to their assigned GPUs, but it is not clear if tools like '''nvidia-smi''' are injected into the container, as they would be with Singularity or the nVidia Container Runtime. ====Running Containers in Docker==== You might be used to running containers with Docker, or containerized GPU workloads with the nVidia Container Runtime or Toolkit. Docker is installed on all the nodes and the daemon is running; if the `docker` command does not work for you, ask cluster-admin to add you to the right groups. We are working on getting the nVidia runtime set up to make the '''--gpus''' option to '''docker run''' work. 5e3f910977d3b6e4bb5cf1b3bb7ac442f02e833b 313 312 2023-05-15T19:55:47Z Anovak 4 /* Running Containers in Docker */ wikitext text/x-wiki When submitting jobs, you can ask for GPUs in one of two ways. One is: #SBATCH --gres=gpu:1 That will ask for 1 GPU generically on a node with a free GPU. This request is more specific: #SBATCH --gres=gpu:A5500:3 That requests 3 A5500 GPUs '''only'''. We have several GPU types on the cluster which may fit your specific needs: nVidia RTX A5500 : 24GB RAM nVidia A100 : 80GB RAM nVidia GeForce RTX 2080 Ti : 11GB RAM nVidia GeForce RTX 1080 Ti : 11GB RAM ==Running GPU Workloads== To actually use an nVidia GPU, you need to run a program that uses the CUDA API. There are a few ways to obtain such a program. ===Prebuilt CUDA Applications=== The Slurm cluster nodes have the nVidia drivers installed, as well as basic CUDA tools like nvidia-smi. Some projects, such as tensorflow, may ship pre-built binaries that can use CUDA. You should be able to run these binaries directly, if you download them. ===Building CUDA Applications=== The cluster nodes do not have the full CUDA Toolkit. In particular, they do not have the '''nvcc''' CUDA-enabled compiler. If you want to compile applications that use CUDA, you will need to install the development environment yourself for your user. Once you have '''nvcc''' available to your user, building CUDA applications should work. To run them, you will have to submit them as jobs, because the head node does not have a GPU. ===Containerized GPU Workloads=== Instead of directly installing binaries, or installing and using the CUDA Toolkit, it is often easiest to use containers to download a prebuilt GPU workload and all of its libraries and dependencies. THere are a few options for running containerized GPU workloads on the cluster. ====Running Containers in Singularity==== You can run containers on the cluster using Singularity, and give them access to GPUs using the '''--nv''' option. For example: singularity pull docker://tensorflow/tensorflow:latest-gpu srun -c 8 --mem 10G --gres=gpu:1 --exclude phoenix-01 singularity run --nv docker://tensorflow/tensorflow:latest-gpu python -c 'from tensorflow.python.client import device_lib; print(device_lib.list_local_devices())' This will produce output showing that the Tensorflow container is indeed able to talk to one GPU: INFO: Using cached SIF image 2023-05-15 11:36:33.110850: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-05-15 11:36:38.799035: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /device:GPU:0 with 22244 MB memory: -> device: 0, name: NVIDIA RTX A5500, pci bus id: 0000:03:00.0, compute capability: 8.6 [name: "/device:CPU:0" device_type: "CPU" memory_limit: 268435456 locality { } incarnation: 8527638019084870106 xla_global_id: -1 , name: "/device:GPU:0" device_type: "GPU" memory_limit: 23324655616 locality { bus_id: 1 links { } } incarnation: 1860154623440434360 physical_device_desc: "device: 0, name: NVIDIA RTX A5500, pci bus id: 0000:03:00.0, compute capability: 8.6" xla_global_id: 416903419 ] ====Running Containers in Slurm==== Slurm itself also supports a '''--container''' option for jobs, which allows a whole job to be run inside a container. If you are able to [https://slurm.schedmd.com/containers.html convert your container to OCI Bundle format], you can pass it directly to Slurm instead of using Singularity from inside the job. However, Docker-compatible image specifiers can't be given to Slurm, only paths to OCI bundles on disk. Stnad-alone tools to download a Docker image from Docker Hub in OCI bundle format ('''skopeo''' and '''umoci''') are not yet installed on the cluster. But the method using the '''docker''' command should work. Slurm containers ''should'' have access to their assigned GPUs, but it is not clear if tools like '''nvidia-smi''' are injected into the container, as they would be with Singularity or the nVidia Container Runtime. ====Running Containers in Docker==== You might be used to running containers with Docker, or containerized GPU workloads with the nVidia Container Runtime or Toolkit. Docker is installed on all the nodes and the daemon is running; if the '''docker''' command does not work for you, ask cluster-admin to add you to the right groups. We are working on getting the nVidia runtime set up to make the '''--gpus''' option to '''docker run''' work. 390b8b7d950122a6dde08c541d077409df833ca1 321 313 2023-05-18T21:39:08Z Anovak 4 wikitext text/x-wiki When submitting jobs, you can ask for GPUs in one of two ways. One is: #SBATCH --gres=gpu:1 That will ask for 1 GPU generically on a node with a free GPU. This request is more specific: #SBATCH --gres=gpu:A5500:3 That requests 3 A5500 GPUs '''only'''. We have several GPU types on the cluster which may fit your specific needs: nVidia RTX A5500 : 24GB RAM nVidia A100 : 80GB RAM nVidia GeForce RTX 2080 Ti : 11GB RAM nVidia GeForce RTX 1080 Ti : 11GB RAM ==Running GPU Workloads== To actually use an nVidia GPU, you need to run a program that uses the CUDA API. There are a few ways to obtain such a program. ===Prebuilt CUDA Applications=== The Slurm cluster nodes have the nVidia drivers installed, as well as basic CUDA tools like nvidia-smi. Some projects, such as tensorflow, may ship pre-built binaries that can use CUDA. You should be able to run these binaries directly, if you download them. ===Building CUDA Applications=== The cluster nodes do not have the full CUDA Toolkit. In particular, they do not have the '''nvcc''' CUDA-enabled compiler. If you want to compile applications that use CUDA, you will need to install the development environment yourself for your user. Once you have '''nvcc''' available to your user, building CUDA applications should work. To run them, you will have to submit them as jobs, because the head node does not have a GPU. ===Containerized GPU Workloads=== Instead of directly installing binaries, or installing and using the CUDA Toolkit, it is often easiest to use containers to download a prebuilt GPU workload and all of its libraries and dependencies. THere are a few options for running containerized GPU workloads on the cluster. ====Running Containers in Singularity==== You can run containers on the cluster using Singularity, and give them access to GPUs using the '''--nv''' option. For example: singularity pull docker://tensorflow/tensorflow:latest-gpu srun -c 8 --mem 10G --gres=gpu:1 --exclude phoenix-01 singularity run --nv docker://tensorflow/tensorflow:latest-gpu python -c 'from tensorflow.python.client import device_lib; print(device_lib.list_local_devices())' This will produce output showing that the Tensorflow container is indeed able to talk to one GPU: INFO: Using cached SIF image 2023-05-15 11:36:33.110850: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-05-15 11:36:38.799035: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /device:GPU:0 with 22244 MB memory: -> device: 0, name: NVIDIA RTX A5500, pci bus id: 0000:03:00.0, compute capability: 8.6 [name: "/device:CPU:0" device_type: "CPU" memory_limit: 268435456 locality { } incarnation: 8527638019084870106 xla_global_id: -1 , name: "/device:GPU:0" device_type: "GPU" memory_limit: 23324655616 locality { bus_id: 1 links { } } incarnation: 1860154623440434360 physical_device_desc: "device: 0, name: NVIDIA RTX A5500, pci bus id: 0000:03:00.0, compute capability: 8.6" xla_global_id: 416903419 ] ====Running Containers in Slurm==== Slurm itself also supports a '''--container''' option for jobs, which allows a whole job to be run inside a container. If you are able to [https://slurm.schedmd.com/containers.html convert your container to OCI Bundle format], you can pass it directly to Slurm instead of using Singularity from inside the job. However, Docker-compatible image specifiers can't be given to Slurm, only paths to OCI bundles on disk. Stnad-alone tools to download a Docker image from Docker Hub in OCI bundle format ('''skopeo''' and '''umoci''') are not yet installed on the cluster. But the method using the '''docker''' command should work. Slurm containers ''should'' have access to their assigned GPUs, but it is not clear if tools like '''nvidia-smi''' are injected into the container, as they would be with Singularity or the nVidia Container Runtime. ====Running Containers in Docker==== You might be used to running containers with Docker, or containerized GPU workloads with the nVidia Container Runtime or Toolkit. Docker is installed on all the nodes and the daemon is running; if the '''docker''' command does not work for you, ask cluster-admin to add you to the right groups. The '''nvidia''' runtime is set up, so you should be able to do: srun -c 1 --mem 4G --gres=gpu:1 docker run --rm --runtime=nvidia --gpus=1 nvidia/cuda:12.1.1-base-ubuntu22.04 nvidia-smi You shouldn't need the '''--runtime''' argument in normal operation, just '''--gpus'''. c08d91a196e44e6d05858c8205b2933b2aef2609 322 321 2023-05-18T21:43:01Z Anovak 4 /* Running Containers in Docker */ wikitext text/x-wiki When submitting jobs, you can ask for GPUs in one of two ways. One is: #SBATCH --gres=gpu:1 That will ask for 1 GPU generically on a node with a free GPU. This request is more specific: #SBATCH --gres=gpu:A5500:3 That requests 3 A5500 GPUs '''only'''. We have several GPU types on the cluster which may fit your specific needs: nVidia RTX A5500 : 24GB RAM nVidia A100 : 80GB RAM nVidia GeForce RTX 2080 Ti : 11GB RAM nVidia GeForce RTX 1080 Ti : 11GB RAM ==Running GPU Workloads== To actually use an nVidia GPU, you need to run a program that uses the CUDA API. There are a few ways to obtain such a program. ===Prebuilt CUDA Applications=== The Slurm cluster nodes have the nVidia drivers installed, as well as basic CUDA tools like nvidia-smi. Some projects, such as tensorflow, may ship pre-built binaries that can use CUDA. You should be able to run these binaries directly, if you download them. ===Building CUDA Applications=== The cluster nodes do not have the full CUDA Toolkit. In particular, they do not have the '''nvcc''' CUDA-enabled compiler. If you want to compile applications that use CUDA, you will need to install the development environment yourself for your user. Once you have '''nvcc''' available to your user, building CUDA applications should work. To run them, you will have to submit them as jobs, because the head node does not have a GPU. ===Containerized GPU Workloads=== Instead of directly installing binaries, or installing and using the CUDA Toolkit, it is often easiest to use containers to download a prebuilt GPU workload and all of its libraries and dependencies. THere are a few options for running containerized GPU workloads on the cluster. ====Running Containers in Singularity==== You can run containers on the cluster using Singularity, and give them access to GPUs using the '''--nv''' option. For example: singularity pull docker://tensorflow/tensorflow:latest-gpu srun -c 8 --mem 10G --gres=gpu:1 --exclude phoenix-01 singularity run --nv docker://tensorflow/tensorflow:latest-gpu python -c 'from tensorflow.python.client import device_lib; print(device_lib.list_local_devices())' This will produce output showing that the Tensorflow container is indeed able to talk to one GPU: INFO: Using cached SIF image 2023-05-15 11:36:33.110850: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-05-15 11:36:38.799035: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /device:GPU:0 with 22244 MB memory: -> device: 0, name: NVIDIA RTX A5500, pci bus id: 0000:03:00.0, compute capability: 8.6 [name: "/device:CPU:0" device_type: "CPU" memory_limit: 268435456 locality { } incarnation: 8527638019084870106 xla_global_id: -1 , name: "/device:GPU:0" device_type: "GPU" memory_limit: 23324655616 locality { bus_id: 1 links { } } incarnation: 1860154623440434360 physical_device_desc: "device: 0, name: NVIDIA RTX A5500, pci bus id: 0000:03:00.0, compute capability: 8.6" xla_global_id: 416903419 ] ====Running Containers in Slurm==== Slurm itself also supports a '''--container''' option for jobs, which allows a whole job to be run inside a container. If you are able to [https://slurm.schedmd.com/containers.html convert your container to OCI Bundle format], you can pass it directly to Slurm instead of using Singularity from inside the job. However, Docker-compatible image specifiers can't be given to Slurm, only paths to OCI bundles on disk. Stnad-alone tools to download a Docker image from Docker Hub in OCI bundle format ('''skopeo''' and '''umoci''') are not yet installed on the cluster. But the method using the '''docker''' command should work. Slurm containers ''should'' have access to their assigned GPUs, but it is not clear if tools like '''nvidia-smi''' are injected into the container, as they would be with Singularity or the nVidia Container Runtime. ====Running Containers in Docker==== You might be used to running containers with Docker, or containerized GPU workloads with the nVidia Container Runtime or Toolkit. Docker is installed on all the nodes and the daemon is running; if the '''docker''' command does not work for you, ask cluster-admin to add you to the right groups. The '''nvidia''' runtime is set up, so you should be able to do: srun -c 1 --mem 4G --gres=gpu:1 docker run --rm --runtime=nvidia --gpus=1 nvidia/cuda:12.1.1-base-ubuntu22.04 nvidia-smi You shouldn't need the '''--runtime''' argument in normal operation, just '''--gpus'''. Further testing is needed to determine if Slurm is able to assign individual GPUs to individual jobs in a way that Docker respects. 275c508838cc1ab38ce0066010301de17308ed67 323 322 2023-05-22T15:07:02Z Anovak 4 /* Running Containers in Docker */ Note how to point Docker at the right GPUs. wikitext text/x-wiki When submitting jobs, you can ask for GPUs in one of two ways. One is: #SBATCH --gres=gpu:1 That will ask for 1 GPU generically on a node with a free GPU. This request is more specific: #SBATCH --gres=gpu:A5500:3 That requests 3 A5500 GPUs '''only'''. We have several GPU types on the cluster which may fit your specific needs: nVidia RTX A5500 : 24GB RAM nVidia A100 : 80GB RAM nVidia GeForce RTX 2080 Ti : 11GB RAM nVidia GeForce RTX 1080 Ti : 11GB RAM ==Running GPU Workloads== To actually use an nVidia GPU, you need to run a program that uses the CUDA API. There are a few ways to obtain such a program. ===Prebuilt CUDA Applications=== The Slurm cluster nodes have the nVidia drivers installed, as well as basic CUDA tools like nvidia-smi. Some projects, such as tensorflow, may ship pre-built binaries that can use CUDA. You should be able to run these binaries directly, if you download them. ===Building CUDA Applications=== The cluster nodes do not have the full CUDA Toolkit. In particular, they do not have the '''nvcc''' CUDA-enabled compiler. If you want to compile applications that use CUDA, you will need to install the development environment yourself for your user. Once you have '''nvcc''' available to your user, building CUDA applications should work. To run them, you will have to submit them as jobs, because the head node does not have a GPU. ===Containerized GPU Workloads=== Instead of directly installing binaries, or installing and using the CUDA Toolkit, it is often easiest to use containers to download a prebuilt GPU workload and all of its libraries and dependencies. THere are a few options for running containerized GPU workloads on the cluster. ====Running Containers in Singularity==== You can run containers on the cluster using Singularity, and give them access to GPUs using the '''--nv''' option. For example: singularity pull docker://tensorflow/tensorflow:latest-gpu srun -c 8 --mem 10G --gres=gpu:1 --exclude phoenix-01 singularity run --nv docker://tensorflow/tensorflow:latest-gpu python -c 'from tensorflow.python.client import device_lib; print(device_lib.list_local_devices())' This will produce output showing that the Tensorflow container is indeed able to talk to one GPU: INFO: Using cached SIF image 2023-05-15 11:36:33.110850: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-05-15 11:36:38.799035: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /device:GPU:0 with 22244 MB memory: -> device: 0, name: NVIDIA RTX A5500, pci bus id: 0000:03:00.0, compute capability: 8.6 [name: "/device:CPU:0" device_type: "CPU" memory_limit: 268435456 locality { } incarnation: 8527638019084870106 xla_global_id: -1 , name: "/device:GPU:0" device_type: "GPU" memory_limit: 23324655616 locality { bus_id: 1 links { } } incarnation: 1860154623440434360 physical_device_desc: "device: 0, name: NVIDIA RTX A5500, pci bus id: 0000:03:00.0, compute capability: 8.6" xla_global_id: 416903419 ] ====Running Containers in Slurm==== Slurm itself also supports a '''--container''' option for jobs, which allows a whole job to be run inside a container. If you are able to [https://slurm.schedmd.com/containers.html convert your container to OCI Bundle format], you can pass it directly to Slurm instead of using Singularity from inside the job. However, Docker-compatible image specifiers can't be given to Slurm, only paths to OCI bundles on disk. Stnad-alone tools to download a Docker image from Docker Hub in OCI bundle format ('''skopeo''' and '''umoci''') are not yet installed on the cluster. But the method using the '''docker''' command should work. Slurm containers ''should'' have access to their assigned GPUs, but it is not clear if tools like '''nvidia-smi''' are injected into the container, as they would be with Singularity or the nVidia Container Runtime. ====Running Containers in Docker==== You might be used to running containers with Docker, or containerized GPU workloads with the nVidia Container Runtime or Toolkit. Docker is installed on all the nodes and the daemon is running; if the '''docker''' command does not work for you, ask cluster-admin to add you to the right groups. The '''nvidia''' runtime is set up and will automatically be used. While Slurm configures each Slurm job with a cgroup that directs it to the correct GPUs, '''using Docker to run another container escapes Slurm's confinement''', and using '''--gpus=1''' will ''always'' use the ''first'' GPU in the system, whether that GPU is assigned to your job or not. When using Docker, you ''must'' consult the '''SLRUM_STEP_GPUS''' environment variable and pass that along to your container. You should also impose limits on all other resources used by your Docker container, so that your whole job stays within the resources allocated by Slurm's scheduler. (TODO: find out how cgroups handles oversubscription between a Docker container and the Slurm container that launched it). An example of a working command is: srun -c 1 --mem 4G --gres=gpu:2 bash -c 'docker run --rm --gpus=\"device=$SLURM_STEP_GPUS\" nvidia/cuda:12.1.1-base-ubuntu22.04 nvidia-smi' Note that the double-quotes are included in the argument to '''--gpus''' as seen by the Docker client, and that '''bash''' and single-quotes are used to ensure that '''$SLURM_STEP_GPUS''' is evaluated within the job itself, and not on the head node. 18a00e086417a11057cc592ae9e8ea28a086c059 324 323 2023-05-22T15:07:48Z Anovak 4 /* Running Containers in Singularity */ wikitext text/x-wiki When submitting jobs, you can ask for GPUs in one of two ways. One is: #SBATCH --gres=gpu:1 That will ask for 1 GPU generically on a node with a free GPU. This request is more specific: #SBATCH --gres=gpu:A5500:3 That requests 3 A5500 GPUs '''only'''. We have several GPU types on the cluster which may fit your specific needs: nVidia RTX A5500 : 24GB RAM nVidia A100 : 80GB RAM nVidia GeForce RTX 2080 Ti : 11GB RAM nVidia GeForce RTX 1080 Ti : 11GB RAM ==Running GPU Workloads== To actually use an nVidia GPU, you need to run a program that uses the CUDA API. There are a few ways to obtain such a program. ===Prebuilt CUDA Applications=== The Slurm cluster nodes have the nVidia drivers installed, as well as basic CUDA tools like nvidia-smi. Some projects, such as tensorflow, may ship pre-built binaries that can use CUDA. You should be able to run these binaries directly, if you download them. ===Building CUDA Applications=== The cluster nodes do not have the full CUDA Toolkit. In particular, they do not have the '''nvcc''' CUDA-enabled compiler. If you want to compile applications that use CUDA, you will need to install the development environment yourself for your user. Once you have '''nvcc''' available to your user, building CUDA applications should work. To run them, you will have to submit them as jobs, because the head node does not have a GPU. ===Containerized GPU Workloads=== Instead of directly installing binaries, or installing and using the CUDA Toolkit, it is often easiest to use containers to download a prebuilt GPU workload and all of its libraries and dependencies. THere are a few options for running containerized GPU workloads on the cluster. ====Running Containers in Singularity==== You can run containers on the cluster using Singularity, and give them access to GPUs using the '''--nv''' option. For example: singularity pull docker://tensorflow/tensorflow:latest-gpu srun -c 8 --mem 10G --gres=gpu:1 singularity run --nv docker://tensorflow/tensorflow:latest-gpu python -c 'from tensorflow.python.client import device_lib; print(device_lib.list_local_devices())' This will produce output showing that the Tensorflow container is indeed able to talk to one GPU: INFO: Using cached SIF image 2023-05-15 11:36:33.110850: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-05-15 11:36:38.799035: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /device:GPU:0 with 22244 MB memory: -> device: 0, name: NVIDIA RTX A5500, pci bus id: 0000:03:00.0, compute capability: 8.6 [name: "/device:CPU:0" device_type: "CPU" memory_limit: 268435456 locality { } incarnation: 8527638019084870106 xla_global_id: -1 , name: "/device:GPU:0" device_type: "GPU" memory_limit: 23324655616 locality { bus_id: 1 links { } } incarnation: 1860154623440434360 physical_device_desc: "device: 0, name: NVIDIA RTX A5500, pci bus id: 0000:03:00.0, compute capability: 8.6" xla_global_id: 416903419 ] ====Running Containers in Slurm==== Slurm itself also supports a '''--container''' option for jobs, which allows a whole job to be run inside a container. If you are able to [https://slurm.schedmd.com/containers.html convert your container to OCI Bundle format], you can pass it directly to Slurm instead of using Singularity from inside the job. However, Docker-compatible image specifiers can't be given to Slurm, only paths to OCI bundles on disk. Stnad-alone tools to download a Docker image from Docker Hub in OCI bundle format ('''skopeo''' and '''umoci''') are not yet installed on the cluster. But the method using the '''docker''' command should work. Slurm containers ''should'' have access to their assigned GPUs, but it is not clear if tools like '''nvidia-smi''' are injected into the container, as they would be with Singularity or the nVidia Container Runtime. ====Running Containers in Docker==== You might be used to running containers with Docker, or containerized GPU workloads with the nVidia Container Runtime or Toolkit. Docker is installed on all the nodes and the daemon is running; if the '''docker''' command does not work for you, ask cluster-admin to add you to the right groups. The '''nvidia''' runtime is set up and will automatically be used. While Slurm configures each Slurm job with a cgroup that directs it to the correct GPUs, '''using Docker to run another container escapes Slurm's confinement''', and using '''--gpus=1''' will ''always'' use the ''first'' GPU in the system, whether that GPU is assigned to your job or not. When using Docker, you ''must'' consult the '''SLRUM_STEP_GPUS''' environment variable and pass that along to your container. You should also impose limits on all other resources used by your Docker container, so that your whole job stays within the resources allocated by Slurm's scheduler. (TODO: find out how cgroups handles oversubscription between a Docker container and the Slurm container that launched it). An example of a working command is: srun -c 1 --mem 4G --gres=gpu:2 bash -c 'docker run --rm --gpus=\"device=$SLURM_STEP_GPUS\" nvidia/cuda:12.1.1-base-ubuntu22.04 nvidia-smi' Note that the double-quotes are included in the argument to '''--gpus''' as seen by the Docker client, and that '''bash''' and single-quotes are used to ensure that '''$SLURM_STEP_GPUS''' is evaluated within the job itself, and not on the head node. b672735e7e45296fdeb3900d1ee2d172d0729ee3 325 324 2023-05-22T15:11:47Z Anovak 4 /* Running Containers in Singularity */ wikitext text/x-wiki When submitting jobs, you can ask for GPUs in one of two ways. One is: #SBATCH --gres=gpu:1 That will ask for 1 GPU generically on a node with a free GPU. This request is more specific: #SBATCH --gres=gpu:A5500:3 That requests 3 A5500 GPUs '''only'''. We have several GPU types on the cluster which may fit your specific needs: nVidia RTX A5500 : 24GB RAM nVidia A100 : 80GB RAM nVidia GeForce RTX 2080 Ti : 11GB RAM nVidia GeForce RTX 1080 Ti : 11GB RAM ==Running GPU Workloads== To actually use an nVidia GPU, you need to run a program that uses the CUDA API. There are a few ways to obtain such a program. ===Prebuilt CUDA Applications=== The Slurm cluster nodes have the nVidia drivers installed, as well as basic CUDA tools like nvidia-smi. Some projects, such as tensorflow, may ship pre-built binaries that can use CUDA. You should be able to run these binaries directly, if you download them. ===Building CUDA Applications=== The cluster nodes do not have the full CUDA Toolkit. In particular, they do not have the '''nvcc''' CUDA-enabled compiler. If you want to compile applications that use CUDA, you will need to install the development environment yourself for your user. Once you have '''nvcc''' available to your user, building CUDA applications should work. To run them, you will have to submit them as jobs, because the head node does not have a GPU. ===Containerized GPU Workloads=== Instead of directly installing binaries, or installing and using the CUDA Toolkit, it is often easiest to use containers to download a prebuilt GPU workload and all of its libraries and dependencies. THere are a few options for running containerized GPU workloads on the cluster. ====Running Containers in Singularity==== You can run containers on the cluster using Singularity, and give them access to the GPUs that Slurm has selected using the '''--nv''' option. For example: singularity pull docker://tensorflow/tensorflow:latest-gpu srun -c 8 --mem 10G --gres=gpu:1 singularity run --nv docker://tensorflow/tensorflow:latest-gpu python -c 'from tensorflow.python.client import device_lib; print(device_lib.list_local_devices())' This will produce output showing that the Tensorflow container is indeed able to talk to one GPU: INFO: Using cached SIF image 2023-05-15 11:36:33.110850: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-05-15 11:36:38.799035: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /device:GPU:0 with 22244 MB memory: -> device: 0, name: NVIDIA RTX A5500, pci bus id: 0000:03:00.0, compute capability: 8.6 [name: "/device:CPU:0" device_type: "CPU" memory_limit: 268435456 locality { } incarnation: 8527638019084870106 xla_global_id: -1 , name: "/device:GPU:0" device_type: "GPU" memory_limit: 23324655616 locality { bus_id: 1 links { } } incarnation: 1860154623440434360 physical_device_desc: "device: 0, name: NVIDIA RTX A5500, pci bus id: 0000:03:00.0, compute capability: 8.6" xla_global_id: 416903419 ] Slurm's containment of the Slurm job to the correct set of GPUs is also passed through to the Singularity container; there is no need to specifically direct Singularity to use the right GPUs unless you are doing something unusual. ====Running Containers in Slurm==== Slurm itself also supports a '''--container''' option for jobs, which allows a whole job to be run inside a container. If you are able to [https://slurm.schedmd.com/containers.html convert your container to OCI Bundle format], you can pass it directly to Slurm instead of using Singularity from inside the job. However, Docker-compatible image specifiers can't be given to Slurm, only paths to OCI bundles on disk. Stnad-alone tools to download a Docker image from Docker Hub in OCI bundle format ('''skopeo''' and '''umoci''') are not yet installed on the cluster. But the method using the '''docker''' command should work. Slurm containers ''should'' have access to their assigned GPUs, but it is not clear if tools like '''nvidia-smi''' are injected into the container, as they would be with Singularity or the nVidia Container Runtime. ====Running Containers in Docker==== You might be used to running containers with Docker, or containerized GPU workloads with the nVidia Container Runtime or Toolkit. Docker is installed on all the nodes and the daemon is running; if the '''docker''' command does not work for you, ask cluster-admin to add you to the right groups. The '''nvidia''' runtime is set up and will automatically be used. While Slurm configures each Slurm job with a cgroup that directs it to the correct GPUs, '''using Docker to run another container escapes Slurm's confinement''', and using '''--gpus=1''' will ''always'' use the ''first'' GPU in the system, whether that GPU is assigned to your job or not. When using Docker, you ''must'' consult the '''SLRUM_STEP_GPUS''' environment variable and pass that along to your container. You should also impose limits on all other resources used by your Docker container, so that your whole job stays within the resources allocated by Slurm's scheduler. (TODO: find out how cgroups handles oversubscription between a Docker container and the Slurm container that launched it). An example of a working command is: srun -c 1 --mem 4G --gres=gpu:2 bash -c 'docker run --rm --gpus=\"device=$SLURM_STEP_GPUS\" nvidia/cuda:12.1.1-base-ubuntu22.04 nvidia-smi' Note that the double-quotes are included in the argument to '''--gpus''' as seen by the Docker client, and that '''bash''' and single-quotes are used to ensure that '''$SLURM_STEP_GPUS''' is evaluated within the job itself, and not on the head node. b6c40aace5dddc1f64b38971cc6e64b6d9b3d2c2 326 325 2023-05-22T15:19:32Z Anovak 4 wikitext text/x-wiki When submitting jobs, you can ask for GPUs in one of two ways. One is: #SBATCH --gres=gpu:1 That will ask for 1 GPU generically on a node with a free GPU. This request is more specific: #SBATCH --gres=gpu:A5500:3 That requests 3 A5500 GPUs '''only'''. We have several GPU types on the cluster which may fit your specific needs: nVidia RTX A5500 : 24GB RAM nVidia A100 : 80GB RAM nVidia GeForce RTX 2080 Ti : 11GB RAM nVidia GeForce RTX 1080 Ti : 11GB RAM For the most part, Slurm takes care of making sure that each job only sees and used the GPUs assigned to it. Within the job, '''CUDA_VISIBLE_DEVICES''' will be set in the environment, but it will always be set to a list of your requested number of GPUs, starting at 0. Slurm re-numbers the GPUs assigned to each job to appear to start at 0, within the job. If you need access to the "real" GPU numbers (to log or to pass along to Docker), they are available in the '''SLURM_STEP_GPUS''' environment variable. ==Running GPU Workloads== To actually use an nVidia GPU, you need to run a program that uses the CUDA API. There are a few ways to obtain such a program. ===Prebuilt CUDA Applications=== The Slurm cluster nodes have the nVidia drivers installed, as well as basic CUDA tools like nvidia-smi. Some projects, such as tensorflow, may ship pre-built binaries that can use CUDA. You should be able to run these binaries directly, if you download them. ===Building CUDA Applications=== The cluster nodes do not have the full CUDA Toolkit. In particular, they do not have the '''nvcc''' CUDA-enabled compiler. If you want to compile applications that use CUDA, you will need to install the development environment yourself for your user. Once you have '''nvcc''' available to your user, building CUDA applications should work. To run them, you will have to submit them as jobs, because the head node does not have a GPU. ===Containerized GPU Workloads=== Instead of directly installing binaries, or installing and using the CUDA Toolkit, it is often easiest to use containers to download a prebuilt GPU workload and all of its libraries and dependencies. THere are a few options for running containerized GPU workloads on the cluster. ====Running Containers in Singularity==== You can run containers on the cluster using Singularity, and give them access to the GPUs that Slurm has selected using the '''--nv''' option. For example: singularity pull docker://tensorflow/tensorflow:latest-gpu srun -c 8 --mem 10G --gres=gpu:1 singularity run --nv docker://tensorflow/tensorflow:latest-gpu python -c 'from tensorflow.python.client import device_lib; print(device_lib.list_local_devices())' This will produce output showing that the Tensorflow container is indeed able to talk to one GPU: INFO: Using cached SIF image 2023-05-15 11:36:33.110850: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-05-15 11:36:38.799035: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /device:GPU:0 with 22244 MB memory: -> device: 0, name: NVIDIA RTX A5500, pci bus id: 0000:03:00.0, compute capability: 8.6 [name: "/device:CPU:0" device_type: "CPU" memory_limit: 268435456 locality { } incarnation: 8527638019084870106 xla_global_id: -1 , name: "/device:GPU:0" device_type: "GPU" memory_limit: 23324655616 locality { bus_id: 1 links { } } incarnation: 1860154623440434360 physical_device_desc: "device: 0, name: NVIDIA RTX A5500, pci bus id: 0000:03:00.0, compute capability: 8.6" xla_global_id: 416903419 ] Slurm's containment of the Slurm job to the correct set of GPUs is also passed through to the Singularity container; there is no need to specifically direct Singularity to use the right GPUs unless you are doing something unusual. ====Running Containers in Slurm==== Slurm itself also supports a '''--container''' option for jobs, which allows a whole job to be run inside a container. If you are able to [https://slurm.schedmd.com/containers.html convert your container to OCI Bundle format], you can pass it directly to Slurm instead of using Singularity from inside the job. However, Docker-compatible image specifiers can't be given to Slurm, only paths to OCI bundles on disk. Stnad-alone tools to download a Docker image from Docker Hub in OCI bundle format ('''skopeo''' and '''umoci''') are not yet installed on the cluster. But the method using the '''docker''' command should work. Slurm containers ''should'' have access to their assigned GPUs, but it is not clear if tools like '''nvidia-smi''' are injected into the container, as they would be with Singularity or the nVidia Container Runtime. ====Running Containers in Docker==== You might be used to running containers with Docker, or containerized GPU workloads with the nVidia Container Runtime or Toolkit. Docker is installed on all the nodes and the daemon is running; if the '''docker''' command does not work for you, ask cluster-admin to add you to the right groups. The '''nvidia''' runtime is set up and will automatically be used. While Slurm configures each Slurm job with a cgroup that directs it to the correct GPUs, '''using Docker to run another container escapes Slurm's confinement''', and using '''--gpus=1''' will ''always'' use the ''first'' GPU in the system, whether that GPU is assigned to your job or not. When using Docker, you ''must'' consult the '''SLRUM_STEP_GPUS''' environment variable and pass that along to your container. You should also impose limits on all other resources used by your Docker container, so that your whole job stays within the resources allocated by Slurm's scheduler. (TODO: find out how cgroups handles oversubscription between a Docker container and the Slurm container that launched it). An example of a working command is: srun -c 1 --mem 4G --gres=gpu:2 bash -c 'docker run --rm --gpus=\"device=$SLURM_STEP_GPUS\" nvidia/cuda:12.1.1-base-ubuntu22.04 nvidia-smi' Note that the double-quotes are included in the argument to '''--gpus''' as seen by the Docker client, and that '''bash''' and single-quotes are used to ensure that '''$SLURM_STEP_GPUS''' is evaluated within the job itself, and not on the head node. eb5bb2a3585f4bad0338c3acbe4867f99aff485f 327 326 2023-05-22T19:32:57Z Anovak 4 wikitext text/x-wiki When submitting jobs, you can ask for GPUs in one of two ways. One is: #SBATCH --gres=gpu:1 That will ask for 1 GPU generically on a node with a free GPU. This request is more specific: #SBATCH --gres=gpu:A5500:3 That requests 3 A5500 GPUs '''only'''. We have several GPU types on the cluster which may fit your specific needs: nVidia RTX A5500 : 24GB RAM nVidia A100 : 80GB RAM nVidia GeForce RTX 2080 Ti : 11GB RAM nVidia GeForce RTX 1080 Ti : 11GB RAM For the most part, Slurm takes care of making sure that each job only sees and used the GPUs assigned to it. Within the job, '''CUDA_VISIBLE_DEVICES''' will be set in the environment, but it will always be set to a list of your requested number of GPUs, starting at 0. Slurm re-numbers the GPUs assigned to each job to appear to start at 0, within the job. If you need access to the "real" GPU numbers (to log or to pass along to Docker), they are available in the '''SLURM_JOB_GPUS''' (for '''sbatch''') or '''SLURM_STEP_GPUS''' (for '''srun''') environment variable. ==Running GPU Workloads== To actually use an nVidia GPU, you need to run a program that uses the CUDA API. There are a few ways to obtain such a program. ===Prebuilt CUDA Applications=== The Slurm cluster nodes have the nVidia drivers installed, as well as basic CUDA tools like nvidia-smi. Some projects, such as tensorflow, may ship pre-built binaries that can use CUDA. You should be able to run these binaries directly, if you download them. ===Building CUDA Applications=== The cluster nodes do not have the full CUDA Toolkit. In particular, they do not have the '''nvcc''' CUDA-enabled compiler. If you want to compile applications that use CUDA, you will need to install the development environment yourself for your user. Once you have '''nvcc''' available to your user, building CUDA applications should work. To run them, you will have to submit them as jobs, because the head node does not have a GPU. ===Containerized GPU Workloads=== Instead of directly installing binaries, or installing and using the CUDA Toolkit, it is often easiest to use containers to download a prebuilt GPU workload and all of its libraries and dependencies. THere are a few options for running containerized GPU workloads on the cluster. ====Running Containers in Singularity==== You can run containers on the cluster using Singularity, and give them access to the GPUs that Slurm has selected using the '''--nv''' option. For example: singularity pull docker://tensorflow/tensorflow:latest-gpu srun -c 8 --mem 10G --gres=gpu:1 singularity run --nv docker://tensorflow/tensorflow:latest-gpu python -c 'from tensorflow.python.client import device_lib; print(device_lib.list_local_devices())' This will produce output showing that the Tensorflow container is indeed able to talk to one GPU: INFO: Using cached SIF image 2023-05-15 11:36:33.110850: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-05-15 11:36:38.799035: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /device:GPU:0 with 22244 MB memory: -> device: 0, name: NVIDIA RTX A5500, pci bus id: 0000:03:00.0, compute capability: 8.6 [name: "/device:CPU:0" device_type: "CPU" memory_limit: 268435456 locality { } incarnation: 8527638019084870106 xla_global_id: -1 , name: "/device:GPU:0" device_type: "GPU" memory_limit: 23324655616 locality { bus_id: 1 links { } } incarnation: 1860154623440434360 physical_device_desc: "device: 0, name: NVIDIA RTX A5500, pci bus id: 0000:03:00.0, compute capability: 8.6" xla_global_id: 416903419 ] Slurm's containment of the Slurm job to the correct set of GPUs is also passed through to the Singularity container; there is no need to specifically direct Singularity to use the right GPUs unless you are doing something unusual. ====Running Containers in Slurm==== Slurm itself also supports a '''--container''' option for jobs, which allows a whole job to be run inside a container. If you are able to [https://slurm.schedmd.com/containers.html convert your container to OCI Bundle format], you can pass it directly to Slurm instead of using Singularity from inside the job. However, Docker-compatible image specifiers can't be given to Slurm, only paths to OCI bundles on disk. Stnad-alone tools to download a Docker image from Docker Hub in OCI bundle format ('''skopeo''' and '''umoci''') are not yet installed on the cluster. But the method using the '''docker''' command should work. Slurm containers ''should'' have access to their assigned GPUs, but it is not clear if tools like '''nvidia-smi''' are injected into the container, as they would be with Singularity or the nVidia Container Runtime. ====Running Containers in Docker==== You might be used to running containers with Docker, or containerized GPU workloads with the nVidia Container Runtime or Toolkit. Docker is installed on all the nodes and the daemon is running; if the '''docker''' command does not work for you, ask cluster-admin to add you to the right groups. The '''nvidia''' runtime is set up and will automatically be used. While Slurm configures each Slurm job with a cgroup that directs it to the correct GPUs, '''using Docker to run another container escapes Slurm's confinement''', and using '''--gpus=1''' will ''always'' use the ''first'' GPU in the system, whether that GPU is assigned to your job or not. When using Docker, you ''must'' consult the '''SLURM_JOB_GPUS''' (for '''sbatch''') or '''SLRUM_STEP_GPUS''' (for '''srun''') environment variable and pass that along to your container. You should also impose limits on all other resources used by your Docker container, so that your whole job stays within the resources allocated by Slurm's scheduler. (TODO: find out how cgroups handles oversubscription between a Docker container and the Slurm container that launched it). An example of a working command is: srun -c 1 --mem 4G --gres=gpu:2 bash -c 'docker run --rm --gpus=\"device=$SLURM_STEP_GPUS\" nvidia/cuda:12.1.1-base-ubuntu22.04 nvidia-smi' Note that the double-quotes are included in the argument to '''--gpus''' as seen by the Docker client, and that '''bash''' and single-quotes are used to ensure that '''$SLURM_STEP_GPUS''' is evaluated within the job itself, and not on the head node. 0e2f12f049a7eb71c72eeba94895e468dff7bb91 Quick Reference Guide 0 35 314 288 2023-05-15T19:59:29Z Anovak 4 /* General Commands */ wikitext text/x-wiki == General Commands == Get documentation on a command: man <command> Try the following commands: man sbatch man squeue man scancel man srun == Submitting jobs == The following example script specifies a partition, time limit, memory allocation and number of cores. All your scripts should specify values for these four parameters. You can also set additional parameters as shown, such as jobname and output file. For This script performs a simple task — it generates of file of random numbers and then sorts it. A detailed explanation the script is available here. #!/bin/bash # #SBATCH -p shared # partition (queue) #SBATCH -c 1 # number of cores #SBATCH --mem 100 # memory pool for all cores #SBATCH -t 0-2:00 # time (D-HH:MM) #SBATCH -o slurm.%N.%j.out # STDOUT #SBATCH -e slurm.%N.%j.err # STDERR for i in {1..100000}; do echo $RANDOM >> SomeRandomNumbers.txt donesort SomeRandomNumbers.txt Now you can submit your job with the command: sbatch myscript.sh If you want to test your job and find out when your job is estimated to run use (note this does not actually submit the job): sbatch --test-only myscript.sh == Information on Jobs == List all current jobs for a user: squeue -u <username> List all running jobs for a user: squeue -u <username> -t RUNNING List all pending jobs for a user: squeue -u <username> -t PENDING List all current jobs in the shared partition for a user: squeue -u <username> -p shared List detailed information for a job (useful for troubleshooting): scontrol show jobid -dd <jobid> List status info for a currently running job: sstat --format=AveCPU,AvePages,AveRSS,AveVMSize,JobID -j <jobid> --allsteps Once your job has completed, you can get additional information that was not available during the run. This includes run time, memory used, etc. To get statistics on completed jobs by jobID: sacct -j <jobid> --format=JobID,JobName,MaxRSS,Elapsed To view the same information for all jobs of a user: sacct -u <username> --format=JobID,JobName,MaxRSS,Elapsed == Controlling jobs == To cancel one job: scancel <jobid> To cancel all the jobs for a user: scancel -u <username> To cancel all the pending jobs for a user: scancel -t PENDING -u <username> To cancel one or more jobs by name: scancel --name myJobName To hold a particular job from being scheduled: scontrol hold <jobid> To release a particular job to be scheduled: scontrol release <jobid> To requeue (cancel and rerun) a particular job: scontrol requeue <jobid> == Job arrays and useful commands == As shown in the commands above, its easy to refer to one job by its Job ID, or to all your jobs via your username. What if you want to refer to a subset of your jobs? The answer is to submit your job set as a job array. Then you can use the job array ID to refer to the set when running SLURM commands. == SLURM job arrays == To cancel an indexed job in a job array: scancel <jobid>_<index> e.g. scancel 1234_4 To find the original submit time for your job array sacct -j 32532756 -o submit -X --noheader | uniq == Advanced (but useful!) Commands == The following commands work for individual jobs and for job arrays, and allow easy manipulation of large numbers of jobs. You can combine these commands with the parameters shown above to provide great flexibility and precision in job control. (Note that all of these commands are entered on one line) Suspend all running jobs for a user (takes into account job arrays): squeue -ho %A -t R | xargs -n 1 scontrol suspend Resume all suspended jobs for a user: squeue -o "%.18A %.18t" -u <username> | awk '{if ($2 =="S"){print $1}}' | xargs -n 1 scontrol resume After resuming, check if any are still suspended: squeue -ho %A -u $USER -t S | wc -l View Cluster State: shost 3e3bdc0ef3cba6b9eafdbb87b5ce0559f50c48ad Genomics Institute Computing Information 0 6 315 298 2023-05-15T20:00:46Z Anovak 4 /* Slurm at the Genomics Institute */ wikitext text/x-wiki Welcome to the Genomic Institute Computing Information Repository! Browse the below topics for help in the area you are curious about. ==GI Public Computing Environment== *[[How to access the public servers]] ==GI Firewalled Computing Environment (PRISM)== *[[Access to the Firewalled Compute Servers]] ==VPN Access== *[[Requirement for users to get GI VPN access]] == NIH dbGaP Access Requirements == *[[Requirements for dbGaP Access]] == giCloud Openstack == *[[Overview of giCloud in the Genomics Institute]] *[[Quick Start Instructions to Get Rolling with OpenStack]] == Amazon Web Services Information == *[[Overview of Getting and Using an AWS IAM Account]] *[[AWS Account List and Numbers]] *[[AWS Shared Bucket Usage Graphs]] *[[AWS Best Practices]] *[[AWS S3 Lifecycle Management]] == Slurm at the Genomics Institute == *[[Overview of using Slurm]] *[[Annotated Slurm Script]] *[[Job Arrays]] *[[GPU Resources]] *[[Quick Reference Guide]] *[[Slurm Tips for VG]] *[[Slurm Tips for Toil]] ==Kubernetes Information== *[[Computational Genomics Kubernetes Installation]] *[[Undiagnosed Disease Project Kubernetes Installation]] == Problems or technical support == If you have any problems with the GI computing environment, please send an email to '''cluster-admin@soe.ucsc.edu''' e9889c977eea429ee5ec73501a3fc72c2fa852b8 316 315 2023-05-15T20:00:58Z Anovak 4 /* Slurm at the Genomics Institute */ wikitext text/x-wiki Welcome to the Genomic Institute Computing Information Repository! Browse the below topics for help in the area you are curious about. ==GI Public Computing Environment== *[[How to access the public servers]] ==GI Firewalled Computing Environment (PRISM)== *[[Access to the Firewalled Compute Servers]] ==VPN Access== *[[Requirement for users to get GI VPN access]] == NIH dbGaP Access Requirements == *[[Requirements for dbGaP Access]] == giCloud Openstack == *[[Overview of giCloud in the Genomics Institute]] *[[Quick Start Instructions to Get Rolling with OpenStack]] == Amazon Web Services Information == *[[Overview of Getting and Using an AWS IAM Account]] *[[AWS Account List and Numbers]] *[[AWS Shared Bucket Usage Graphs]] *[[AWS Best Practices]] *[[AWS S3 Lifecycle Management]] == Slurm at the Genomics Institute == *[[Overview of using Slurm]] *[[Annotated Slurm Script]] *[[Job Arrays]] *[[GPU Resources]] *[[Quick Reference Guide]] *[[Slurm Tips for vg]] *[[Slurm Tips for Toil]] ==Kubernetes Information== *[[Computational Genomics Kubernetes Installation]] *[[Undiagnosed Disease Project Kubernetes Installation]] == Problems or technical support == If you have any problems with the GI computing environment, please send an email to '''cluster-admin@soe.ucsc.edu''' 7b4c419ee0d7f71633b5e3b5f8e3d9ace0bb6de5 340 316 2023-06-14T22:24:21Z Weiler 3 /* GI Firewalled Computing Environment (PRISM) */ wikitext text/x-wiki Welcome to the Genomic Institute Computing Information Repository! Browse the below topics for help in the area you are curious about. ==GI Public Computing Environment== *[[How to access the public servers]] ==GI Firewalled Computing Environment (PRISM)== *[[Access to the Firewalled Compute Servers]] *[[Firewalled Environment Storage Overview]] ==VPN Access== *[[Requirement for users to get GI VPN access]] == NIH dbGaP Access Requirements == *[[Requirements for dbGaP Access]] == giCloud Openstack == *[[Overview of giCloud in the Genomics Institute]] *[[Quick Start Instructions to Get Rolling with OpenStack]] == Amazon Web Services Information == *[[Overview of Getting and Using an AWS IAM Account]] *[[AWS Account List and Numbers]] *[[AWS Shared Bucket Usage Graphs]] *[[AWS Best Practices]] *[[AWS S3 Lifecycle Management]] == Slurm at the Genomics Institute == *[[Overview of using Slurm]] *[[Annotated Slurm Script]] *[[Job Arrays]] *[[GPU Resources]] *[[Quick Reference Guide]] *[[Slurm Tips for vg]] *[[Slurm Tips for Toil]] ==Kubernetes Information== *[[Computational Genomics Kubernetes Installation]] *[[Undiagnosed Disease Project Kubernetes Installation]] == Problems or technical support == If you have any problems with the GI computing environment, please send an email to '''cluster-admin@soe.ucsc.edu''' ad686f3283d1edef5addcbe22d4aadaee24606b7 346 340 2023-06-14T22:55:20Z Weiler 3 wikitext text/x-wiki Welcome to the Genomic Institute Computing Information Repository! Browse the below topics for help in the area you are curious about. ==GI Public Computing Environment== *[[How to access the public servers]] ==GI Firewalled Computing Environment (PRISM)== *[[Access to the Firewalled Compute Servers]] *[[Computing Resources Overview]] *[[Firewalled Environment Storage Overview]] ==VPN Access== *[[Requirement for users to get GI VPN access]] == NIH dbGaP Access Requirements == *[[Requirements for dbGaP Access]] == giCloud Openstack == *[[Overview of giCloud in the Genomics Institute]] *[[Quick Start Instructions to Get Rolling with OpenStack]] == Amazon Web Services Information == *[[Overview of Getting and Using an AWS IAM Account]] *[[AWS Account List and Numbers]] *[[AWS Shared Bucket Usage Graphs]] *[[AWS Best Practices]] *[[AWS S3 Lifecycle Management]] == Slurm at the Genomics Institute == *[[Overview of using Slurm]] *[[Annotated Slurm Script]] *[[Job Arrays]] *[[GPU Resources]] *[[Quick Reference Guide]] *[[Slurm Tips for vg]] *[[Slurm Tips for Toil]] ==Kubernetes Information== *[[Computational Genomics Kubernetes Installation]] *[[Undiagnosed Disease Project Kubernetes Installation]] == Problems or technical support == If you have any problems with the GI computing environment, please send an email to '''cluster-admin@soe.ucsc.edu''' 49d2750023c4c01c5009891997e3626abfb91f5c Slurm Tips for vg 0 37 317 2023-05-15T21:33:26Z Anovak 4 Created page with "This page explains how to set up a development environment for [https://github.com/vgteam/vg vg] on the Phoenix cluster. 1. Make yourself a user directory under '''/private/g..." wikitext text/x-wiki This page explains how to set up a development environment for [https://github.com/vgteam/vg vg] on the Phoenix cluster. 1. Make yourself a user directory under '''/private/groups''', which is where large data must be stored. For example, if you are in the Paten lab: mkdir /private/groups/patenlab/$USER 2. (Optional) Link it over to your home directory, so it is easy to use storage there to store your repos. The '''/private/groups''' storage may be faster than the home directory storage. mkdir -p /private/groups/patenlab/$USER/workspace ln -s /private/groups/patenlab/$USER/workspace ~/workspace 3. Make sure you have SSH keys created and add them to Github. cat ~/.ssh/id_ed25519.pub || (ssh-keygen -t dsa && cat ~/.ssh/id_ed25519.pub) # Paste into https://github.com/settings/ssh/new 4. Make a place to put your clone, and clone vg: mkdir -p ~/workspace cd ~/workspace git clone --recursive git@github.com:vgteam/vg.git cd vg 5. vg's dependencies should already be installed on the cluster nodes. If any of them seem to be missing, tell cluster-admin@soe.ucsc.edu to install them. 6. Build vg as a Slurm job. This will send the build out to the cluster as a 64-core, 80G memory job, and keep the output logs in your terminal. srun -c 64 --mem=80G make -j64 This will leave your vg binary at '''~/workspace/vg/bin/vg'''. a68363181a35394a19b196e1098ea28556e594ca 319 317 2023-05-15T21:40:42Z Anovak 4 wikitext text/x-wiki This page explains how to set up a development environment for [https://github.com/vgteam/vg vg] on the Phoenix cluster. ==Setting Up== 1. After connecting to the VPN, connect to the cluster head node: ssh phoenix.prism This node is relatively small, so you shouldn't run real work on it, but it is the place you need to be to submit Slurm jobs. 2. Make yourself a user directory under '''/private/groups''', which is where large data must be stored. For example, if you are in the Paten lab: mkdir /private/groups/patenlab/$USER 3. (Optional) Link it over to your home directory, so it is easy to use storage there to store your repos. The '''/private/groups''' storage may be faster than the home directory storage. mkdir -p /private/groups/patenlab/$USER/workspace ln -s /private/groups/patenlab/$USER/workspace ~/workspace 4. Make sure you have SSH keys created and add them to Github. cat ~/.ssh/id_ed25519.pub || (ssh-keygen -t dsa && cat ~/.ssh/id_ed25519.pub) # Paste into https://github.com/settings/ssh/new 5. Make a place to put your clone, and clone vg: mkdir -p ~/workspace cd ~/workspace git clone --recursive git@github.com:vgteam/vg.git cd vg 6. vg's dependencies should already be installed on the cluster nodes. If any of them seem to be missing, tell cluster-admin@soe.ucsc.edu to install them. 7. Build vg as a Slurm job. This will send the build out to the cluster as a 64-core, 80G memory job, and keep the output logs in your terminal. srun -c 64 --mem=80G make -j64 This will leave your vg binary at '''~/workspace/vg/bin/vg'''. ==Misc Tips== * If you want an interactive session with appreciable resources, you can schedule one with '''srun'''. For example, to get 16 cores and 120G memory all for you, run: srun -c 16 --mem 120G --pty bash -i ca7518d2d41e3817d70e8f2c6011fbda8d036290 320 319 2023-05-15T21:43:24Z Anovak 4 /* Misc Tips */ wikitext text/x-wiki This page explains how to set up a development environment for [https://github.com/vgteam/vg vg] on the Phoenix cluster. ==Setting Up== 1. After connecting to the VPN, connect to the cluster head node: ssh phoenix.prism This node is relatively small, so you shouldn't run real work on it, but it is the place you need to be to submit Slurm jobs. 2. Make yourself a user directory under '''/private/groups''', which is where large data must be stored. For example, if you are in the Paten lab: mkdir /private/groups/patenlab/$USER 3. (Optional) Link it over to your home directory, so it is easy to use storage there to store your repos. The '''/private/groups''' storage may be faster than the home directory storage. mkdir -p /private/groups/patenlab/$USER/workspace ln -s /private/groups/patenlab/$USER/workspace ~/workspace 4. Make sure you have SSH keys created and add them to Github. cat ~/.ssh/id_ed25519.pub || (ssh-keygen -t dsa && cat ~/.ssh/id_ed25519.pub) # Paste into https://github.com/settings/ssh/new 5. Make a place to put your clone, and clone vg: mkdir -p ~/workspace cd ~/workspace git clone --recursive git@github.com:vgteam/vg.git cd vg 6. vg's dependencies should already be installed on the cluster nodes. If any of them seem to be missing, tell cluster-admin@soe.ucsc.edu to install them. 7. Build vg as a Slurm job. This will send the build out to the cluster as a 64-core, 80G memory job, and keep the output logs in your terminal. srun -c 64 --mem=80G make -j64 This will leave your vg binary at '''~/workspace/vg/bin/vg'''. ==Misc Tips== * If you want an interactive session with appreciable resources, you can schedule one with '''srun'''. For example, to get 16 cores and 120G memory all for you, run: srun -c 16 --mem 120G --pty bash -i * To send out a job without making a script file for it, use '''sbatch --wrap "your command here"'''. * You can use arguments from SBATCH lines on the command line! * You can use [https://github.com/CLIP-HPC/SlurmCommander#readme Slurm Commander] to watch the state of the cluster with the '''scom''' command. 06be2d97ad3c4c8ec953960fd47432e22d1c3941 Slurm Tips for Toil 0 38 318 2023-05-15T21:34:40Z Anovak 4 Created page with "Here are some tips for running Toil workflows on the Phoenix Slurm cluster. Mostly you might want to run WDL workflows, but you can use some of these for other workflows like..." wikitext text/x-wiki Here are some tips for running Toil workflows on the Phoenix Slurm cluster. Mostly you might want to run WDL workflows, but you can use some of these for other workflows like Cactus. You can also consult [https://github.com/DataBiosphere/toil/blob/master/docs/running/wdl.rst#running-wdl-with-toil the Toil documentation on WDL workflows]. * Because the new WDL interpreter in Toil isn't yet in any release, you would want to install Toil from source with WDL support with: pip3 install git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl] * You will then need to make sure your '''~/.local/bin''' directory is on your PATH. Open up your '''~/.bashrc''' file and add: export PATH=$PATH:$HOME/.local/bin Then make sure to log out and back in again. * For Toil options, you will want '''--batchSystem slurm''' to make it use Slurm and '''--batchLogsDir ./logs''' (or some other location on a shared filesystem) for the Slurm logs to not get lost. * If using '''toil-wdl-runner''', you might want to add '''--jobStore ./jobStore''' to make sure the job store is in a defined, shared location so that you can use '''--restart''' later. * If using '''toil-wdl-runner''', you will want to set the '''SINGULARITY_CACHEDIR''' and '''MINIWDL__SINGULARITY__IMAGE_CACHE''' environment variables for your workflow to locations on shared storage, and possibly to the default cache locations in your home directory. Otherwise Toil will set them to node-local directories for each node, and thus re-download images for each workflow run, and for each cluster node. To avoid this, you could, for example, before your run or in your '''~/.bashrc''' you could: export SINGULARITY_CACHEDIR=$HOME/.singularity/cache export MINIWDL__SINGULARITY__IMAGE_CACHE=$HOME/.cache/miniwdl f88f3272699a4c4fc15fd09af40679cf761e1103 329 318 2023-06-05T21:29:47Z Anovak 4 wikitext text/x-wiki Here are some tips for running Toil workflows on the Phoenix Slurm cluster. Mostly you might want to run WDL workflows, but you can use some of these for other workflows like Cactus. You can also consult [https://github.com/DataBiosphere/toil/blob/master/docs/running/wdl.rst#running-wdl-with-toil the Toil documentation on WDL workflows]. * Because the new WDL interpreter in Toil isn't yet in any release, you would want to install Toil from source with WDL support with: pip3 install git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl] * You will then need to make sure your '''~/.local/bin''' directory is on your PATH. Open up your '''~/.bashrc''' file and add: export PATH=$PATH:$HOME/.local/bin Then make sure to log out and back in again. * For Toil options, you will want '''--batchSystem slurm''' to make it use Slurm and '''--batchLogsDir ./logs''' (or some other location on a shared filesystem) for the Slurm logs to not get lost. * You may be able to speed up your workflow with '''--caching true''', to cache data on nodes to be shared among multiple simultaneous tasks. * If using '''toil-wdl-runner''', you might want to add '''--jobStore ./jobStore''' to make sure the job store is in a defined, shared location so that you can use '''--restart''' later. * If using '''toil-wdl-runner''', you will want to set the '''SINGULARITY_CACHEDIR''' and '''MINIWDL__SINGULARITY__IMAGE_CACHE''' environment variables for your workflow to locations on shared storage, and possibly to the default cache locations in your home directory. Otherwise Toil will set them to node-local directories for each node, and thus re-download images for each workflow run, and for each cluster node. To avoid this, you could, for example, before your run or in your '''~/.bashrc''' you could: export SINGULARITY_CACHEDIR=$HOME/.singularity/cache export MINIWDL__SINGULARITY__IMAGE_CACHE=$HOME/.cache/miniwdl 3d0667981261dfffc315c727cb0a3d6d02d97c9f AWS Account List and Numbers 0 22 328 266 2023-05-27T19:58:41Z Weiler 3 wikitext text/x-wiki This is a list of our currently available AWS accounts and their account numbers: ucsc-bd2k : 862902209576 ucsc-toil-dev : 318423852362 ucsc-vg-dev : 781907127277 ucsc-platform-dev : 719818754276 comparative-genomics-dev : 162786355865 nanopore-dev : 270442831226 ucsc-cgp-production : 097093801910 platform-hca-dev : 122796619775 anvil-dev : 608666466534 gi-gateway : 652235167018 pangenomics : 422448306679 braingeneers : 443872533066 ucsctreehouse : 238605363322 ucsc-bisti-dev : 851631505710 ucsc-genome-browser : 784962239183 dockstore-dev : 635220370222 ucsc-spatial : 541180793903 platform-hca-prod : 542754589326 miga-lab : 156518225147 platform-anvil-dev : 289950828509 platform-anvil-prod : 465330168186 platform-anvil-portal : 166384485414 agc-runs : 598929688444 sequencing-center-cold-store : 436140841220 c2811e5d92919ef08d25644f74409ccd11e95920 Quick Start Instructions to Get Rolling with OpenStack 0 26 330 215 2023-06-06T15:06:27Z Anovak 4 /* Upload your SSH Public Key */ wikitext text/x-wiki __TOC__ ==Request an OpenStack Account== Once you have [http://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access PRISM/GI VPN access], you can request an OpenStack account. You will need to send an email to cluster-admin@soe.ucsc.edu asking for access, and let us know which lab you are in, or who your PI is, so we can place you in the right OpenStack group. ==Create a SSH Public/Private Keypair== To log into an OpenStack VM instance, you will need a SSH public key. The key is "injected" into the instance upon creation, and only someone with that key (i.e. you) will be able to log in via SSH initially. If you already have a SSH public and private key that you use elsewhere, you can use that one, and can skip to the next step. If you don't have a SSH keypair set up yet, then you will need to log into the UNIX compatible machine you will be '''logging in from''' (a Mac/Apple computer will also work), and then run the 'ssh-keygen' command. If you are behind the VPN, you can first log into mustard, crimson or razzmatazz, which are linux servers. The command will look something like this: $ ssh-keygen -t rsa Generating public/private rsa key pair. Enter file in which to save the key (/public/home/frank/.ssh/id_rsa): Created directory '/public/home/frank/.ssh'. Enter passphrase (empty for no passphrase): [JUST HIT ENTER] Enter same passphrase again: [JUST HIT ENTER] Your identification has been saved in /public/home/frank/.ssh/id_rsa. Your public key has been saved in /public/home/frank/.ssh/id_rsa.pub. The key fingerprint is: SHA256:dhJG1A3gcwj7Mz17ommt3NIczMVVgrzp8Tf6F1X4jpI The key's randomart image is: +---[RSA 2048]----+ | ..+o.+ ..o.| | = .. + o..| | . * .. + ..| | o = * o| | So+o + o.| | . =+oE ooo| | +o.....o| | .o++o . .| | .=o. ...| +----[SHA256]-----+ You will then have a new directory, "~/.ssh", and inside that directory you will have a file called "id_rsa.pub". That is your SSH public key. You will need this in the next step in order to set up your key in OpenStack. ==Log In To giCloud== Once you have been notified that your account has been set up and have been given login credentials, connect to the VPN and then go to this link in your favorite web browser, which is the login page: http://gicloud.prism To login, enter your username and password. Also you will see a "Domain" field, just enter the word "default" for the domain. Click "Log In". You will be logged into your group's summary page. ==Upload your SSH Public Key== After creating your new key in the above "Create a SSH Public/Private Keypair" step, you will need to upload that key into OpenStack. Once you are logged in, on the left hand navigation menu, click "Project", then in the submenu, select "Compute", and finally select "Key Pairs". It should take you to the "Key Pairs" window as shown here. [[File:Keypairs.png|900px]] Next click the "Import Public Key" button on the top right of the window. In the resulting window, name your key in the "Key Pair Name" field. Name it something descriptive like "laptop-key" if the key is on your laptop, or "mustard-key" if you are logged into mustard, etc. '''Your key must be an RSA key!''' The newer ED25519 keys '''do not work''' with our version of OpenStack. To get your key, open a terminal window and type "cat ~/.ssh/id_rsa.pub" to get your full key, as so: $ cat ~/.ssh/id_rsa.pub ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAyVKNfdBbDIk7Iq8JmL+u3vxAn4M1iaQgMU5tHJhMSAYBZEZRLZAFc+Qovxe5zzs1ixte9lCipLep39q2I4U8XND17nYliZ4HVM4MW4GsMUfKsgX2FI3mB2vAQ9pZSLkAhTg2D+92uALUSSv1cDZhTqo7DuPRX2Upxyd5QbRL6TRFswBjHz2vY/JpaPQm1S1d10mokPpmxehLfwp0mVgmz1Uv/6FflqiZ68DhDN67cs1yQgWYXQ01IHPjzTKRwCuZVkgT99rkoqy6TkAyrvsfzYPZbIA2y+ovOBzq6WCUT9gp5Jx/UE6CxLSmAuGPAJkV5D/twKIe75xc+5jdi3I1cgKw== user@laptop Copy that whole line, starting with "ssh-rsa" all the way through the very last character, including the "user@laptop" bit (which may be different for you, just be sure to include it in the line copy). Then back in the OpenStack Key Pair dialogue window, paste in the keypair in the "Public Key" window, then click "Import Key". The key should then appear in the key list. ==Launch a New Instance== We are now ready to launch our new VM instance. On the left navigation menu, select "Project", then in the submenu, select "Compute", and finally select "Instances". You will see any currently running instances in your group in the resulting screen. Next you need to click the "Launch Instance" button on the top right. You will be put into the "Details" tab in the instance creation dialogue. You need to choose an instance name and enter it into the "Instance Name" field. It should include your username as a prefix so that others know who owns each instance. Something like "frank-newtest1" would work well. You can ignore the "Description" field, "Availability Zone" should be "nova: and "Count" should be "1". Next click the "Source" tab on the left. In the "Source" menu, in the "Select Boot Source" field, select "Image" and next to it select "No" for "Create New Volume". Then in the below list of images, choose your image and click the little "Up Arrow" icon to the right of the image you want to add it. Next click the "Flavor" tab on the left. In that menu, choose how much CPU, RAM and disk space you want for your new VM. Some images have minimum requirements, and as such some of the smaller flavors may not be available. Select your flavor by clicking the little "Up Arrow" icon on the right of your flavor. Next click the "Key Pair" tab on the left. Click the little "Up Arrow" to the right of the Kep Pair you created in the previous step where you create a Key Pair. Ignore the rest of the options on the left, you have configured all you need to launch the instance. Click the blue "Launch Instance" button on the bottom right of your window, as seen below: [[File:Launch.png|850px]] You will be taken back to the Instances Summary page and you should see your new instance launching. After a bit your instance will change from the "Spawning" to "Running". This means the instance is now booting, and should finish booting in a minute or two. In the meantime we will need to attach a "Floating IP" address to your instance such that you can SSH into the instance. On the right side of your running instance, you should see a drop-down menu, usually the "Create Snapshot" option is pre-selected. Click the drop down menu arrow to open that menu, and select "Associate Floating IP". In the "Associate Floating IP" dialogue, click the drop down menu to see if any IP addresses are already available, and if so, go ahead and select one. If there are none available, click the little "+" button to the right to allocate a floating IP address. It will ask you what Pool to use, select "ext-net". You can put in a description if you want but most folks leave that field blank. Then click "Allocate IP". It will take you back one menu level. It will have a field "Port to be Associated", just leave that alone with the default that is already there. Click the blue "Associate" button on the bottom right of the window. You will be returned to the "Instances Summary" page again. You will see your instance running, and it should now list a "Floating IP" that it is running under. That is the IP that you will use to SSH to the instance. ==Connect to Your New Instance== Now that your instance is up and running, let's SSH to it and get going! '''From the computer you created your SSH keys on,''' SSH to your instance using the username as the OS type you chose (ubuntu, centos, etc), and the Floating IP address your instance has. '''You must be connected to the VPN for this to work!''' Example: $ ssh ubuntu@10.50.100.67 If you launched a CentOS instance, it would instead be "ssh centos@10.50.100.67", as appropriate. Assuming everything went as planned, you will be logged into your new Linux instance as the "ubuntu" or "centos" user, which is an unprivileged user. If you get a "Connection Refused" error when trying to SSH in, it means your instance isn't quite through launching yet, try again in about 30 seconds. You have full sudo rights however to do whatever administration you need to do. At this point it is assumed you have a little systems administration skills in your belt, or at least have some time to query Google as to how to perform various Linux tasks as necessary. Your instance has full Internet access to the Greater Internet, so you can download thing fro the Internet, run "apt-get install" or "yum update" or whatever is appropriate. You can also then install any needed software you need to get your work done. '''NOTE:''' Your are the Systems Administrator of your instance - we cannot support questions on how to administer Linux for you. If OpenStack itself is having issues then please let us know, but please defer questions like "How do I install software on Ubuntu" to Google searches. ==Storage on Your New Instance== Most of your storage on your new instance will be located in the /mnt directory, as seen by a "df -h" command on your instance: ubuntu@erich1:~$ df -h Filesystem Size Used Avail Use% Mounted on udev 16G 0 16G 0% /dev tmpfs 3.2G 676K 3.2G 1% /run /dev/vda1 20G 975M 19G 5% / tmpfs 16G 0 16G 0% /dev/shm tmpfs 5.0M 0 5.0M 0% /run/lock tmpfs 16G 0 16G 0% /sys/fs/cgroup /dev/vda15 105M 3.4M 102M 4% /boot/efi /dev/vdb1 1.0T 1.1G 1023G 1% /mnt tmpfs 3.2G 0 3.2G 0% /run/user/1000 Notice that "/mnt" has 1TB of disk space, so store all your big important data in /mnt. Avoid storing data on "/" whenever possible to prevent issues with the root filesystem filling up. The exact amount of storage available will depend on what flavor you chose when creating the instance. ==Instance Control Options== Just a few notes on controlling your instances. They are fully functioning Linux machines, so a "sudo reboot" will reboot the machine, "sudo poweroff" will shut it down, etc. In cloud parlance, "Shut Down" means the instance is still there but the power is off. "Terminated" means it's fully deleted and is unrecoverable, so be sure you want to delete your instance before you do so. We do not back instances up. We also have no access to your instance so we cannot log in and see what's going on. You can control your instance in several ways from the OpenStack web interface, in the Instance Summary page. On the right side of your instance in the list will be that little drop down menu. Options of interest are: '''1: Create Snapshot''' Never use this option as we have not implemented snapshotting in this environment. '''2: View Log''' This will show you the boot/console log of the instance, so you can see if anything is causing issues. '''3: Hard Reboot Instance''' This will hard reboot your instance, kind of like hitting the power button to power the instance off, then it will power back on moments later. Useful if your instance is hosed because of a software crash or other things that may have crashed the instance. '''4: Delete Instance''' This will permanently destroy your instance. It will be deleted and is unrecoverable. It will also free up the resources it was using such that others can use them however. This is useful if the group quotas have been reached and some old instances need to be cleaned out to make room for new ones. '''5: Start Instance''' This option will be available if the instance is in the "Shut Down" state. It will boot up the instance when this option is invoked. Do not use the other options you may see there, most have not been implemented in our deployment of OpenStack. ==Changing Your OpenStack Web Interface Password== Once you have logged in to the Web Interface, you can change your password by doing the following. On the top right of the OpenStack web interface, you should see a little icon with your username on it. Click that icon to expand the drop down menu there, and select "Settings". Then in the next window, on the left navigation bar, you should see the "Change Password" button. Complete the Change Password dialogue to change your password. You may have to log in again after changing your password. ==Networking== Your instances are connected at 10Gb/s between each other and the internet. Of course, actual transfer speeds will likely vary based on disk speed, speed of the location to are transferring data to or from, and other factors. Your instance will be located in a private network that can only be seen by other instances in your group. Other OpenStack groups are logically separated into their own networks and your instance cannot route to them. Also, no one can access your instance unless they have a VPN account with us, so your instances are completely fenced off from the Greater Internet inbound, which means you are largely secure against script kiddies and hackers. You are able to connect outbound from your instances. ==Etiquette== There is one main thing to remember when using instances in OpenStack. When you create an instance, it uses CPU, RAM and most importantly, it pins disk space for that instance. If you use up all the disk, CPU and RAM quota for your group, then others have no resources left to create their own instances. It is important to know that the best plan of action is to fire up your VM and keep it up when you need it, and then copy your data off it and delete the instance. Document steps taken to create your instance such that you could do it again if you needed to. If the physical node that your instance resides on blows up, then your instance is lost forever and we have no backups, so it is up to you to back up important data. It's also not good form to spin up an instance and store data there, but not log in for months at a time. Then, you are pinning resources that other may need for urgent work. Try to be a good neighbor! 292e94bce6c3fa6906cdf6d9d7a557838f81957c 331 330 2023-06-06T15:09:29Z Anovak 4 /* Launch a New Instance */ wikitext text/x-wiki __TOC__ ==Request an OpenStack Account== Once you have [http://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access PRISM/GI VPN access], you can request an OpenStack account. You will need to send an email to cluster-admin@soe.ucsc.edu asking for access, and let us know which lab you are in, or who your PI is, so we can place you in the right OpenStack group. ==Create a SSH Public/Private Keypair== To log into an OpenStack VM instance, you will need a SSH public key. The key is "injected" into the instance upon creation, and only someone with that key (i.e. you) will be able to log in via SSH initially. If you already have a SSH public and private key that you use elsewhere, you can use that one, and can skip to the next step. If you don't have a SSH keypair set up yet, then you will need to log into the UNIX compatible machine you will be '''logging in from''' (a Mac/Apple computer will also work), and then run the 'ssh-keygen' command. If you are behind the VPN, you can first log into mustard, crimson or razzmatazz, which are linux servers. The command will look something like this: $ ssh-keygen -t rsa Generating public/private rsa key pair. Enter file in which to save the key (/public/home/frank/.ssh/id_rsa): Created directory '/public/home/frank/.ssh'. Enter passphrase (empty for no passphrase): [JUST HIT ENTER] Enter same passphrase again: [JUST HIT ENTER] Your identification has been saved in /public/home/frank/.ssh/id_rsa. Your public key has been saved in /public/home/frank/.ssh/id_rsa.pub. The key fingerprint is: SHA256:dhJG1A3gcwj7Mz17ommt3NIczMVVgrzp8Tf6F1X4jpI The key's randomart image is: +---[RSA 2048]----+ | ..+o.+ ..o.| | = .. + o..| | . * .. + ..| | o = * o| | So+o + o.| | . =+oE ooo| | +o.....o| | .o++o . .| | .=o. ...| +----[SHA256]-----+ You will then have a new directory, "~/.ssh", and inside that directory you will have a file called "id_rsa.pub". That is your SSH public key. You will need this in the next step in order to set up your key in OpenStack. ==Log In To giCloud== Once you have been notified that your account has been set up and have been given login credentials, connect to the VPN and then go to this link in your favorite web browser, which is the login page: http://gicloud.prism To login, enter your username and password. Also you will see a "Domain" field, just enter the word "default" for the domain. Click "Log In". You will be logged into your group's summary page. ==Upload your SSH Public Key== After creating your new key in the above "Create a SSH Public/Private Keypair" step, you will need to upload that key into OpenStack. Once you are logged in, on the left hand navigation menu, click "Project", then in the submenu, select "Compute", and finally select "Key Pairs". It should take you to the "Key Pairs" window as shown here. [[File:Keypairs.png|900px]] Next click the "Import Public Key" button on the top right of the window. In the resulting window, name your key in the "Key Pair Name" field. Name it something descriptive like "laptop-key" if the key is on your laptop, or "mustard-key" if you are logged into mustard, etc. '''Your key must be an RSA key!''' The newer ED25519 keys '''do not work''' with our version of OpenStack. To get your key, open a terminal window and type "cat ~/.ssh/id_rsa.pub" to get your full key, as so: $ cat ~/.ssh/id_rsa.pub ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAyVKNfdBbDIk7Iq8JmL+u3vxAn4M1iaQgMU5tHJhMSAYBZEZRLZAFc+Qovxe5zzs1ixte9lCipLep39q2I4U8XND17nYliZ4HVM4MW4GsMUfKsgX2FI3mB2vAQ9pZSLkAhTg2D+92uALUSSv1cDZhTqo7DuPRX2Upxyd5QbRL6TRFswBjHz2vY/JpaPQm1S1d10mokPpmxehLfwp0mVgmz1Uv/6FflqiZ68DhDN67cs1yQgWYXQ01IHPjzTKRwCuZVkgT99rkoqy6TkAyrvsfzYPZbIA2y+ovOBzq6WCUT9gp5Jx/UE6CxLSmAuGPAJkV5D/twKIe75xc+5jdi3I1cgKw== user@laptop Copy that whole line, starting with "ssh-rsa" all the way through the very last character, including the "user@laptop" bit (which may be different for you, just be sure to include it in the line copy). Then back in the OpenStack Key Pair dialogue window, paste in the keypair in the "Public Key" window, then click "Import Key". The key should then appear in the key list. ==Launch a New Instance== We are now ready to launch our new VM instance. On the left navigation menu, select "Project", then in the submenu, select "Compute", and finally select "Instances". You will see any currently running instances in your group in the resulting screen. Next you need to click the "Launch Instance" button on the top right. You will be put into the "Details" tab in the instance creation dialogue. You need to choose an instance name and enter it into the "Instance Name" field. It should include your username as a prefix so that others know who owns each instance. Something like "frank-newtest1" would work well. You can ignore the "Description" field, "Availability Zone" should be "nova" and "Count" should be "1". Next click the "Source" tab on the left. In the "Source" menu, in the "Select Boot Source" field, select "Image" and next to it select "No" for "Create New Volume". Then in the below list of images, choose your image and click the little "Up Arrow" icon to the right of the image you want to add it. Next click the "Flavor" tab on the left. In that menu, choose how much CPU, RAM and disk space you want for your new VM. Some images have minimum requirements, and as such some of the smaller flavors may not be available. Select your flavor by clicking the little "Up Arrow" icon on the right of your flavor. Next click the "Key Pair" tab on the left. Click the little "Up Arrow" to the right of the Kep Pair you created in the previous step where you create a Key Pair. Ignore the rest of the options on the left, you have configured all you need to launch the instance. Click the blue "Launch Instance" button on the bottom right of your window, as seen below: [[File:Launch.png|850px]] You will be taken back to the Instances Summary page and you should see your new instance launching. After a bit your instance will change from the "Spawning" to "Running". This means the instance is now booting, and should finish booting in a minute or two. In the meantime we will need to attach a "Floating IP" address to your instance such that you can SSH into the instance. On the right side of your running instance, you should see a drop-down menu, usually the "Create Snapshot" option is pre-selected. Click the drop down menu arrow to open that menu, and select "Associate Floating IP". In the "Associate Floating IP" dialogue, click the drop down menu to see if any IP addresses are already available, and if so, go ahead and select one. If there are none available, click the little "+" button to the right to allocate a floating IP address. It will ask you what Pool to use, select "ext-net". You can put in a description if you want but most folks leave that field blank. Then click "Allocate IP". It will take you back one menu level. It will have a field "Port to be Associated", just leave that alone with the default that is already there. Click the blue "Associate" button on the bottom right of the window. You will be returned to the "Instances Summary" page again. You will see your instance running, and it should now list a "Floating IP" that it is running under. That is the IP that you will use to SSH to the instance. ==Connect to Your New Instance== Now that your instance is up and running, let's SSH to it and get going! '''From the computer you created your SSH keys on,''' SSH to your instance using the username as the OS type you chose (ubuntu, centos, etc), and the Floating IP address your instance has. '''You must be connected to the VPN for this to work!''' Example: $ ssh ubuntu@10.50.100.67 If you launched a CentOS instance, it would instead be "ssh centos@10.50.100.67", as appropriate. Assuming everything went as planned, you will be logged into your new Linux instance as the "ubuntu" or "centos" user, which is an unprivileged user. If you get a "Connection Refused" error when trying to SSH in, it means your instance isn't quite through launching yet, try again in about 30 seconds. You have full sudo rights however to do whatever administration you need to do. At this point it is assumed you have a little systems administration skills in your belt, or at least have some time to query Google as to how to perform various Linux tasks as necessary. Your instance has full Internet access to the Greater Internet, so you can download thing fro the Internet, run "apt-get install" or "yum update" or whatever is appropriate. You can also then install any needed software you need to get your work done. '''NOTE:''' Your are the Systems Administrator of your instance - we cannot support questions on how to administer Linux for you. If OpenStack itself is having issues then please let us know, but please defer questions like "How do I install software on Ubuntu" to Google searches. ==Storage on Your New Instance== Most of your storage on your new instance will be located in the /mnt directory, as seen by a "df -h" command on your instance: ubuntu@erich1:~$ df -h Filesystem Size Used Avail Use% Mounted on udev 16G 0 16G 0% /dev tmpfs 3.2G 676K 3.2G 1% /run /dev/vda1 20G 975M 19G 5% / tmpfs 16G 0 16G 0% /dev/shm tmpfs 5.0M 0 5.0M 0% /run/lock tmpfs 16G 0 16G 0% /sys/fs/cgroup /dev/vda15 105M 3.4M 102M 4% /boot/efi /dev/vdb1 1.0T 1.1G 1023G 1% /mnt tmpfs 3.2G 0 3.2G 0% /run/user/1000 Notice that "/mnt" has 1TB of disk space, so store all your big important data in /mnt. Avoid storing data on "/" whenever possible to prevent issues with the root filesystem filling up. The exact amount of storage available will depend on what flavor you chose when creating the instance. ==Instance Control Options== Just a few notes on controlling your instances. They are fully functioning Linux machines, so a "sudo reboot" will reboot the machine, "sudo poweroff" will shut it down, etc. In cloud parlance, "Shut Down" means the instance is still there but the power is off. "Terminated" means it's fully deleted and is unrecoverable, so be sure you want to delete your instance before you do so. We do not back instances up. We also have no access to your instance so we cannot log in and see what's going on. You can control your instance in several ways from the OpenStack web interface, in the Instance Summary page. On the right side of your instance in the list will be that little drop down menu. Options of interest are: '''1: Create Snapshot''' Never use this option as we have not implemented snapshotting in this environment. '''2: View Log''' This will show you the boot/console log of the instance, so you can see if anything is causing issues. '''3: Hard Reboot Instance''' This will hard reboot your instance, kind of like hitting the power button to power the instance off, then it will power back on moments later. Useful if your instance is hosed because of a software crash or other things that may have crashed the instance. '''4: Delete Instance''' This will permanently destroy your instance. It will be deleted and is unrecoverable. It will also free up the resources it was using such that others can use them however. This is useful if the group quotas have been reached and some old instances need to be cleaned out to make room for new ones. '''5: Start Instance''' This option will be available if the instance is in the "Shut Down" state. It will boot up the instance when this option is invoked. Do not use the other options you may see there, most have not been implemented in our deployment of OpenStack. ==Changing Your OpenStack Web Interface Password== Once you have logged in to the Web Interface, you can change your password by doing the following. On the top right of the OpenStack web interface, you should see a little icon with your username on it. Click that icon to expand the drop down menu there, and select "Settings". Then in the next window, on the left navigation bar, you should see the "Change Password" button. Complete the Change Password dialogue to change your password. You may have to log in again after changing your password. ==Networking== Your instances are connected at 10Gb/s between each other and the internet. Of course, actual transfer speeds will likely vary based on disk speed, speed of the location to are transferring data to or from, and other factors. Your instance will be located in a private network that can only be seen by other instances in your group. Other OpenStack groups are logically separated into their own networks and your instance cannot route to them. Also, no one can access your instance unless they have a VPN account with us, so your instances are completely fenced off from the Greater Internet inbound, which means you are largely secure against script kiddies and hackers. You are able to connect outbound from your instances. ==Etiquette== There is one main thing to remember when using instances in OpenStack. When you create an instance, it uses CPU, RAM and most importantly, it pins disk space for that instance. If you use up all the disk, CPU and RAM quota for your group, then others have no resources left to create their own instances. It is important to know that the best plan of action is to fire up your VM and keep it up when you need it, and then copy your data off it and delete the instance. Document steps taken to create your instance such that you could do it again if you needed to. If the physical node that your instance resides on blows up, then your instance is lost forever and we have no backups, so it is up to you to back up important data. It's also not good form to spin up an instance and store data there, but not log in for months at a time. Then, you are pinning resources that other may need for urgent work. Try to be a good neighbor! 2479bff28d9ffcd56d1c9d26d22a0966edf5f15e Access to the Firewalled Compute Servers 0 17 332 242 2023-06-12T21:52:11Z Weiler 3 /* Server Types and Management */ wikitext text/x-wiki Before you can access the firewalled environment (Prism), you must get VPN access to it, which is detailed here: [[Requirement for users to get GI VPN access]] == Account and Storage Cost == Costs for having an active UNIX account and for storage (per TB) are listed in this document, under "Genomics Project Support", specifically "Genomics IT Systems User Support per user" and "Genomics Data Storage per TB": https://planning.ucsc.edu/budget/rates-and-assessments/recharge-rates/docs/2021-22-approved-recharge-rates.pdf == Account Expiration == Your UNIX account will have an expiration date associated with it after creation. If your account was created in January, February or March, then your account expiration date will be July 1st of the '''current year'''. If the account was created after March, then your expiration date will be July 1st of the '''following year'''. You will receive notice by email when your account is about to expire. To renew, simply ask the PI that sponsored you (which will be included in the notice) to email '''cluster-admin@soe.ucsc.edu''' requesting that your account to be renewed for another year. If your account expires, the account will be suspended and you will no longer be able to login or view any data you may have in our systems. Any automated scripts (owned by you) that run via cron or other mechanisms will cease to function. == Server Types and Management== After confirming your VPN software is working, you can ssh into one of the compute servers behind the VPN: '''crimson.prism''': 256GB RAM, 32 cores, 5.5TB local scratch space, CentOS 7.9 '''razzmatazz.prism''': 256GB RAM, 32 cores, 5.5TB local scratch space, Ubuntu 22.04 '''mustard.prism''': 1.5TB RAM, 160 cores, 9TB local scratch space, Ubuntu 22.04 These servers are managed by the Genomics Institute Cluster Admin group. If you need software installed on either or both of these servers, please make your request by emailing cluster-admin@soe.ucsc.edu. == Storage == These servers mount two types of storage; home directories and group storage directories. Your home directory will be located as "/private/home/username" and has a 30GB quota. The group storage directories are created per PI, and each group directory has a 15TB quota. For example, if David Haussler is the PI that you report to directly, then the directory would exist as /private/groups/hausslerlab. Request access to that group directory and you will then be able to write to it. Each of those group directories are shared by the lab it belongs to, so you must be wary of everyone's data usage and share the 15TB available per group accordingly. On the compute servers you can check your group's current quota usage by using the '/usr/bin/viewquota' command. You can only check the quota of a group you are part of (you would be a member of the UNIX group of the same name). If you wanted to check the quota usage of /private/groups/hausslerlab for example, you would do: $ viewquota hausslerlab Project quota on /export (/dev/mapper/export) Project ID Used Soft Hard Warn/Grace ---------- --------------------------------- hausslerlab 1.8T 15T 16T 00 [------] == Actually Doing Work and Computing == When doing research, running jobs and the like, please be careful of your resource consumption on the server you are on. Don't run too many threads or cores at once if such a thing overruns the RAM available or the disk IO available. If you are not sure of your potential RAM, CPU or disk impact, start small with one or two processes and work your way up from there. Also, before running your stuff, check what else is already happening on the server by using the 'top' command to see who else and what else is running and what kind of resources are already being consumed. If, after starting a process, you realize that the server slows down considerably or becomes unusable, kill your processes and re-evaluate what you need to make things work. These servers are shared resources - be a good neighbor! == The Firewall == All servers are behind a firewall in this environment, and as such, you must connect to the VPN in order to access them. They will not be accessible from the greater Internet without VPN. Although you will be able to connect outbound from them to other servers on the internet to copy data in, sync git repos, stuff like that. It is only inbound connections that will be blocked. All machines behind the firewall have the private domain name suffix of "*.prism". == /scratch Space on the Servers == Each server will generally have a local /scratch filesystem that you can use to store temporary files. '''BE ADVISED''' that /scratch is not backed up, and the data there could disappear in the event of a disk failure or anything else. Do not store important data there. If it is important, it should be moved somewhere else very soon after creation. 98cbf2ca3d135338aeb24d3e57d9b1ef2694536a 333 332 2023-06-12T21:52:43Z Weiler 3 /* Server Types and Management */ wikitext text/x-wiki Before you can access the firewalled environment (Prism), you must get VPN access to it, which is detailed here: [[Requirement for users to get GI VPN access]] == Account and Storage Cost == Costs for having an active UNIX account and for storage (per TB) are listed in this document, under "Genomics Project Support", specifically "Genomics IT Systems User Support per user" and "Genomics Data Storage per TB": https://planning.ucsc.edu/budget/rates-and-assessments/recharge-rates/docs/2021-22-approved-recharge-rates.pdf == Account Expiration == Your UNIX account will have an expiration date associated with it after creation. If your account was created in January, February or March, then your account expiration date will be July 1st of the '''current year'''. If the account was created after March, then your expiration date will be July 1st of the '''following year'''. You will receive notice by email when your account is about to expire. To renew, simply ask the PI that sponsored you (which will be included in the notice) to email '''cluster-admin@soe.ucsc.edu''' requesting that your account to be renewed for another year. If your account expires, the account will be suspended and you will no longer be able to login or view any data you may have in our systems. Any automated scripts (owned by you) that run via cron or other mechanisms will cease to function. == Server Types and Management== After confirming your VPN software is working, you can ssh into one of the compute servers behind the VPN: '''crimson.prism''': 256GB RAM, 32 cores, 5.5TB local scratch space, CentOS 7.9 '''razzmatazz.prism''': 256GB RAM, 32 cores, 5.5TB local scratch space, Ubuntu 22.04 '''mustard.prism''': 1.5TB RAM, 160 cores, 9TB local scratch space, Ubuntu 22.04 These servers are managed by the Genomics Institute Cluster Admin group. If you need software installed on any of these servers, please make your request by emailing cluster-admin@soe.ucsc.edu. == Storage == These servers mount two types of storage; home directories and group storage directories. Your home directory will be located as "/private/home/username" and has a 30GB quota. The group storage directories are created per PI, and each group directory has a 15TB quota. For example, if David Haussler is the PI that you report to directly, then the directory would exist as /private/groups/hausslerlab. Request access to that group directory and you will then be able to write to it. Each of those group directories are shared by the lab it belongs to, so you must be wary of everyone's data usage and share the 15TB available per group accordingly. On the compute servers you can check your group's current quota usage by using the '/usr/bin/viewquota' command. You can only check the quota of a group you are part of (you would be a member of the UNIX group of the same name). If you wanted to check the quota usage of /private/groups/hausslerlab for example, you would do: $ viewquota hausslerlab Project quota on /export (/dev/mapper/export) Project ID Used Soft Hard Warn/Grace ---------- --------------------------------- hausslerlab 1.8T 15T 16T 00 [------] == Actually Doing Work and Computing == When doing research, running jobs and the like, please be careful of your resource consumption on the server you are on. Don't run too many threads or cores at once if such a thing overruns the RAM available or the disk IO available. If you are not sure of your potential RAM, CPU or disk impact, start small with one or two processes and work your way up from there. Also, before running your stuff, check what else is already happening on the server by using the 'top' command to see who else and what else is running and what kind of resources are already being consumed. If, after starting a process, you realize that the server slows down considerably or becomes unusable, kill your processes and re-evaluate what you need to make things work. These servers are shared resources - be a good neighbor! == The Firewall == All servers are behind a firewall in this environment, and as such, you must connect to the VPN in order to access them. They will not be accessible from the greater Internet without VPN. Although you will be able to connect outbound from them to other servers on the internet to copy data in, sync git repos, stuff like that. It is only inbound connections that will be blocked. All machines behind the firewall have the private domain name suffix of "*.prism". == /scratch Space on the Servers == Each server will generally have a local /scratch filesystem that you can use to store temporary files. '''BE ADVISED''' that /scratch is not backed up, and the data there could disappear in the event of a disk failure or anything else. Do not store important data there. If it is important, it should be moved somewhere else very soon after creation. d4a1b348245d89d389b026b695a2685b0751bdbd 334 333 2023-06-12T21:58:48Z Weiler 3 wikitext text/x-wiki Before you can access the firewalled environment (Prism), you must get VPN access to it, which is detailed here: [[Requirement for users to get GI VPN access]] == Account and Storage Cost == Costs for having an active UNIX account and for storage (per TB) are listed in this document, under "Genomics Project Support", specifically "Genomics IT Systems User Support per user" and "Genomics Data Storage per TB": https://planning.ucsc.edu/budget/rates-and-assessments/recharge-rates/docs/2021-22-approved-recharge-rates.pdf == Account Expiration == Your UNIX account will have an expiration date associated with it after creation. If your account was created in January, February or March, then your account expiration date will be July 1st of the '''current year'''. If the account was created after March, then your expiration date will be July 1st of the '''following year'''. You will receive notice by email when your account is about to expire. To renew, simply ask the PI that sponsored you (which will be included in the notice) to email '''cluster-admin@soe.ucsc.edu''' requesting that your account to be renewed for another year. If your account expires, the account will be suspended and you will no longer be able to login or view any data you may have in our systems. Any automated scripts (owned by you) that run via cron or other mechanisms will cease to function. == Server Types and Management== After confirming your VPN software is working, you can ssh into one of the compute servers behind the VPN: '''crimson.prism''': 256GB RAM, 32 cores, 5.5TB local scratch space, CentOS 7.9 '''razzmatazz.prism''': 256GB RAM, 32 cores, 5.5TB local scratch space, Ubuntu 22.04 '''mustard.prism''': 1.5TB RAM, 160 cores, 9TB local scratch space, Ubuntu 22.04 These servers are managed by the Genomics Institute Cluster Admin group. If you need software installed on any of these servers, please make your request by emailing cluster-admin@soe.ucsc.edu. == Storage == These servers mount two types of storage; home directories and group storage directories. Your home directory will be located as "/private/home/username" and has a 30GB quota. The group storage directories are created per PI, and each group directory has a 15TB quota. For example, if David Haussler is the PI that you report to directly, then the directory would exist as /private/groups/hausslerlab. Request access to that group directory and you will then be able to write to it. Each of those group directories are shared by the lab it belongs to, so you must be wary of everyone's data usage and share the 15TB available per group accordingly. On the compute servers you can check your group's current quota usage by using the '/usr/bin/viewquota' command. You can only check the quota of a group you are part of (you would be a member of the UNIX group of the same name). If you wanted to check the quota usage of /private/groups/hausslerlab for example, you would do: $ viewquota hausslerlab Project quota on /export (/dev/mapper/export) Project ID Used Soft Hard Warn/Grace ---------- --------------------------------- hausslerlab 1.8T 15T 16T 00 [------] == Actually Doing Work and Computing == When doing research, running jobs and the like, please be careful of your resource consumption on the server you are on. Don't run too many threads or cores at once if such a thing overruns the RAM available or the disk IO available. If you are not sure of your potential RAM, CPU or disk impact, start small with one or two processes and work your way up from there. Also, before running your stuff, check what else is already happening on the server by using the 'top' command to see who else and what else is running and what kind of resources are already being consumed. If, after starting a process, you realize that the server slows down considerably or becomes unusable, kill your processes and re-evaluate what you need to make things work. These servers are shared resources - be a good neighbor! == The Firewall == All servers are behind a firewall in this environment, and as such, you must connect to the VPN in order to access them. They will not be accessible from the greater Internet without VPN. Although you will be able to connect outbound from them to other servers on the internet to copy data in, sync git repos, stuff like that. It is only inbound connections that will be blocked. All machines behind the firewall have the private domain name suffix of "*.prism". == /scratch Space on the Servers == Each server will generally have a local /scratch filesystem that you can use to store temporary files. '''BE ADVISED''' that /scratch is not backed up, and the data there could disappear in the event of a disk failure or anything else. Do not store important data there. If it is important, it should be moved somewhere else very soon after creation. == The Phoenix Cluster == This is a cluster of ~20 nodes, some of which have GPUs in them. Each node generally has about 2TB RAM and 128 cores, although the cluster is heterogeneous and has multiple node types. The cluster head node, from which all jobs are submitted via the SLURM job scheduling framework, is """phoenix.prism""". To learn more about how to use Slurm, refer to: c0d69ad79b3cf2d8b1278b358442794c89612355 335 334 2023-06-12T22:00:29Z Weiler 3 wikitext text/x-wiki Before you can access the firewalled environment (Prism), you must get VPN access to it, which is detailed here: [[Requirement for users to get GI VPN access]] == Account and Storage Cost == Costs for having an active UNIX account and for storage (per TB) are listed in this document, under "Genomics Project Support", specifically "Genomics IT Systems User Support per user" and "Genomics Data Storage per TB": https://planning.ucsc.edu/budget/rates-and-assessments/recharge-rates/docs/2021-22-approved-recharge-rates.pdf == Account Expiration == Your UNIX account will have an expiration date associated with it after creation. If your account was created in January, February or March, then your account expiration date will be July 1st of the '''current year'''. If the account was created after March, then your expiration date will be July 1st of the '''following year'''. You will receive notice by email when your account is about to expire. To renew, simply ask the PI that sponsored you (which will be included in the notice) to email '''cluster-admin@soe.ucsc.edu''' requesting that your account to be renewed for another year. If your account expires, the account will be suspended and you will no longer be able to login or view any data you may have in our systems. Any automated scripts (owned by you) that run via cron or other mechanisms will cease to function. == Server Types and Management== After confirming your VPN software is working, you can ssh into one of the compute servers behind the VPN: '''crimson.prism''': 256GB RAM, 32 cores, 5.5TB local scratch space, CentOS 7.9 '''razzmatazz.prism''': 256GB RAM, 32 cores, 5.5TB local scratch space, Ubuntu 22.04 '''mustard.prism''': 1.5TB RAM, 160 cores, 9TB local scratch space, Ubuntu 22.04 These servers are managed by the Genomics Institute Cluster Admin group. If you need software installed on any of these servers, please make your request by emailing cluster-admin@soe.ucsc.edu. == Storage == These servers mount two types of storage; home directories and group storage directories. Your home directory will be located as "/private/home/username" and has a 30GB quota. The group storage directories are created per PI, and each group directory has a 15TB quota. For example, if David Haussler is the PI that you report to directly, then the directory would exist as /private/groups/hausslerlab. Request access to that group directory and you will then be able to write to it. Each of those group directories are shared by the lab it belongs to, so you must be wary of everyone's data usage and share the 15TB available per group accordingly. On the compute servers you can check your group's current quota usage by using the '/usr/bin/viewquota' command. You can only check the quota of a group you are part of (you would be a member of the UNIX group of the same name). If you wanted to check the quota usage of /private/groups/hausslerlab for example, you would do: $ viewquota hausslerlab Project quota on /export (/dev/mapper/export) Project ID Used Soft Hard Warn/Grace ---------- --------------------------------- hausslerlab 1.8T 15T 16T 00 [------] == Actually Doing Work and Computing == When doing research, running jobs and the like, please be careful of your resource consumption on the server you are on. Don't run too many threads or cores at once if such a thing overruns the RAM available or the disk IO available. If you are not sure of your potential RAM, CPU or disk impact, start small with one or two processes and work your way up from there. Also, before running your stuff, check what else is already happening on the server by using the 'top' command to see who else and what else is running and what kind of resources are already being consumed. If, after starting a process, you realize that the server slows down considerably or becomes unusable, kill your processes and re-evaluate what you need to make things work. These servers are shared resources - be a good neighbor! == The Firewall == All servers are behind a firewall in this environment, and as such, you must connect to the VPN in order to access them. They will not be accessible from the greater Internet without VPN. Although you will be able to connect outbound from them to other servers on the internet to copy data in, sync git repos, stuff like that. It is only inbound connections that will be blocked. All machines behind the firewall have the private domain name suffix of "*.prism". == /scratch Space on the Servers == Each server will generally have a local /scratch filesystem that you can use to store temporary files. '''BE ADVISED''' that /scratch is not backed up, and the data there could disappear in the event of a disk failure or anything else. Do not store important data there. If it is important, it should be moved somewhere else very soon after creation. == The Phoenix Cluster == This is a cluster of ~20 Ubuntu 22.04 nodes, some of which have GPUs in them. Each node generally has about 2TB RAM and 128 cores, although the cluster is heterogeneous and has multiple node types. The cluster head node, from which all jobs are submitted via the SLURM job scheduling framework, is """phoenix.prism""". To learn more about how to use Slurm, refer to: https://giwiki.gi.ucsc.edu/index.php/Genomics_Institute_Computing_Information#Slurm_at_the_Genomics_Institute ffc8abc1d25c5f2acf930a91b4c7bf868729bf87 336 335 2023-06-12T22:01:15Z Weiler 3 /* The Phoenix Cluster */ wikitext text/x-wiki Before you can access the firewalled environment (Prism), you must get VPN access to it, which is detailed here: [[Requirement for users to get GI VPN access]] == Account and Storage Cost == Costs for having an active UNIX account and for storage (per TB) are listed in this document, under "Genomics Project Support", specifically "Genomics IT Systems User Support per user" and "Genomics Data Storage per TB": https://planning.ucsc.edu/budget/rates-and-assessments/recharge-rates/docs/2021-22-approved-recharge-rates.pdf == Account Expiration == Your UNIX account will have an expiration date associated with it after creation. If your account was created in January, February or March, then your account expiration date will be July 1st of the '''current year'''. If the account was created after March, then your expiration date will be July 1st of the '''following year'''. You will receive notice by email when your account is about to expire. To renew, simply ask the PI that sponsored you (which will be included in the notice) to email '''cluster-admin@soe.ucsc.edu''' requesting that your account to be renewed for another year. If your account expires, the account will be suspended and you will no longer be able to login or view any data you may have in our systems. Any automated scripts (owned by you) that run via cron or other mechanisms will cease to function. == Server Types and Management== After confirming your VPN software is working, you can ssh into one of the compute servers behind the VPN: '''crimson.prism''': 256GB RAM, 32 cores, 5.5TB local scratch space, CentOS 7.9 '''razzmatazz.prism''': 256GB RAM, 32 cores, 5.5TB local scratch space, Ubuntu 22.04 '''mustard.prism''': 1.5TB RAM, 160 cores, 9TB local scratch space, Ubuntu 22.04 These servers are managed by the Genomics Institute Cluster Admin group. If you need software installed on any of these servers, please make your request by emailing cluster-admin@soe.ucsc.edu. == Storage == These servers mount two types of storage; home directories and group storage directories. Your home directory will be located as "/private/home/username" and has a 30GB quota. The group storage directories are created per PI, and each group directory has a 15TB quota. For example, if David Haussler is the PI that you report to directly, then the directory would exist as /private/groups/hausslerlab. Request access to that group directory and you will then be able to write to it. Each of those group directories are shared by the lab it belongs to, so you must be wary of everyone's data usage and share the 15TB available per group accordingly. On the compute servers you can check your group's current quota usage by using the '/usr/bin/viewquota' command. You can only check the quota of a group you are part of (you would be a member of the UNIX group of the same name). If you wanted to check the quota usage of /private/groups/hausslerlab for example, you would do: $ viewquota hausslerlab Project quota on /export (/dev/mapper/export) Project ID Used Soft Hard Warn/Grace ---------- --------------------------------- hausslerlab 1.8T 15T 16T 00 [------] == Actually Doing Work and Computing == When doing research, running jobs and the like, please be careful of your resource consumption on the server you are on. Don't run too many threads or cores at once if such a thing overruns the RAM available or the disk IO available. If you are not sure of your potential RAM, CPU or disk impact, start small with one or two processes and work your way up from there. Also, before running your stuff, check what else is already happening on the server by using the 'top' command to see who else and what else is running and what kind of resources are already being consumed. If, after starting a process, you realize that the server slows down considerably or becomes unusable, kill your processes and re-evaluate what you need to make things work. These servers are shared resources - be a good neighbor! == The Firewall == All servers are behind a firewall in this environment, and as such, you must connect to the VPN in order to access them. They will not be accessible from the greater Internet without VPN. Although you will be able to connect outbound from them to other servers on the internet to copy data in, sync git repos, stuff like that. It is only inbound connections that will be blocked. All machines behind the firewall have the private domain name suffix of "*.prism". == /scratch Space on the Servers == Each server will generally have a local /scratch filesystem that you can use to store temporary files. '''BE ADVISED''' that /scratch is not backed up, and the data there could disappear in the event of a disk failure or anything else. Do not store important data there. If it is important, it should be moved somewhere else very soon after creation. == The Phoenix Cluster == This is a cluster of ~20 Ubuntu 22.04 nodes, some of which have GPUs in them. Each node generally has about 2TB RAM and 128 cores, although the cluster is heterogeneous and has multiple node types. The cluster head node, from which all jobs are submitted via the SLURM job scheduling framework, is '''phoenix.prism'''. To learn more about how to use Slurm, refer to: https://giwiki.gi.ucsc.edu/index.php/Genomics_Institute_Computing_Information#Slurm_at_the_Genomics_Institute 557feb1ebed5a241e834b48c549e7a3fc6b79381 337 336 2023-06-12T22:03:12Z Weiler 3 /* The Phoenix Cluster */ wikitext text/x-wiki Before you can access the firewalled environment (Prism), you must get VPN access to it, which is detailed here: [[Requirement for users to get GI VPN access]] == Account and Storage Cost == Costs for having an active UNIX account and for storage (per TB) are listed in this document, under "Genomics Project Support", specifically "Genomics IT Systems User Support per user" and "Genomics Data Storage per TB": https://planning.ucsc.edu/budget/rates-and-assessments/recharge-rates/docs/2021-22-approved-recharge-rates.pdf == Account Expiration == Your UNIX account will have an expiration date associated with it after creation. If your account was created in January, February or March, then your account expiration date will be July 1st of the '''current year'''. If the account was created after March, then your expiration date will be July 1st of the '''following year'''. You will receive notice by email when your account is about to expire. To renew, simply ask the PI that sponsored you (which will be included in the notice) to email '''cluster-admin@soe.ucsc.edu''' requesting that your account to be renewed for another year. If your account expires, the account will be suspended and you will no longer be able to login or view any data you may have in our systems. Any automated scripts (owned by you) that run via cron or other mechanisms will cease to function. == Server Types and Management== After confirming your VPN software is working, you can ssh into one of the compute servers behind the VPN: '''crimson.prism''': 256GB RAM, 32 cores, 5.5TB local scratch space, CentOS 7.9 '''razzmatazz.prism''': 256GB RAM, 32 cores, 5.5TB local scratch space, Ubuntu 22.04 '''mustard.prism''': 1.5TB RAM, 160 cores, 9TB local scratch space, Ubuntu 22.04 These servers are managed by the Genomics Institute Cluster Admin group. If you need software installed on any of these servers, please make your request by emailing cluster-admin@soe.ucsc.edu. == Storage == These servers mount two types of storage; home directories and group storage directories. Your home directory will be located as "/private/home/username" and has a 30GB quota. The group storage directories are created per PI, and each group directory has a 15TB quota. For example, if David Haussler is the PI that you report to directly, then the directory would exist as /private/groups/hausslerlab. Request access to that group directory and you will then be able to write to it. Each of those group directories are shared by the lab it belongs to, so you must be wary of everyone's data usage and share the 15TB available per group accordingly. On the compute servers you can check your group's current quota usage by using the '/usr/bin/viewquota' command. You can only check the quota of a group you are part of (you would be a member of the UNIX group of the same name). If you wanted to check the quota usage of /private/groups/hausslerlab for example, you would do: $ viewquota hausslerlab Project quota on /export (/dev/mapper/export) Project ID Used Soft Hard Warn/Grace ---------- --------------------------------- hausslerlab 1.8T 15T 16T 00 [------] == Actually Doing Work and Computing == When doing research, running jobs and the like, please be careful of your resource consumption on the server you are on. Don't run too many threads or cores at once if such a thing overruns the RAM available or the disk IO available. If you are not sure of your potential RAM, CPU or disk impact, start small with one or two processes and work your way up from there. Also, before running your stuff, check what else is already happening on the server by using the 'top' command to see who else and what else is running and what kind of resources are already being consumed. If, after starting a process, you realize that the server slows down considerably or becomes unusable, kill your processes and re-evaluate what you need to make things work. These servers are shared resources - be a good neighbor! == The Firewall == All servers are behind a firewall in this environment, and as such, you must connect to the VPN in order to access them. They will not be accessible from the greater Internet without VPN. Although you will be able to connect outbound from them to other servers on the internet to copy data in, sync git repos, stuff like that. It is only inbound connections that will be blocked. All machines behind the firewall have the private domain name suffix of "*.prism". == /scratch Space on the Servers == Each server will generally have a local /scratch filesystem that you can use to store temporary files. '''BE ADVISED''' that /scratch is not backed up, and the data there could disappear in the event of a disk failure or anything else. Do not store important data there. If it is important, it should be moved somewhere else very soon after creation. == The Phoenix Cluster == This is a cluster of ~20 Ubuntu 22.04 nodes, some of which have GPUs in them. Each node generally has about 2TB RAM and 128 cores, although the cluster is heterogeneous and has multiple node types. The cluster head node, from which all jobs are submitted via the SLURM job scheduling framework, is '''phoenix.prism'''. To learn more about how to use Slurm, refer to: https://giwiki.gi.ucsc.edu/index.php/Genomics_Institute_Computing_Information#Slurm_at_the_Genomics_Institute For scratch on the cluster, TMPDIR will be set to /data/tmp. That area is cleaned often so don't store any data there that isn't being used by your jobs. c5e19abc732323407ce12fc0540c17a432c0c405 339 337 2023-06-14T22:23:30Z Weiler 3 wikitext text/x-wiki Before you can access the firewalled environment (Prism), you must get VPN access to it, which is detailed here: [[Requirement for users to get GI VPN access]] == Account and Storage Cost == Costs for having an active UNIX account and for storage (per TB) are listed in this document, under "Genomics Project Support", specifically "Genomics IT Systems User Support per user" and "Genomics Data Storage per TB": https://planning.ucsc.edu/budget/rates-and-assessments/recharge-rates/docs/2021-22-approved-recharge-rates.pdf == Account Expiration == Your UNIX account will have an expiration date associated with it after creation. If your account was created in January, February or March, then your account expiration date will be July 1st of the '''current year'''. If the account was created after March, then your expiration date will be July 1st of the '''following year'''. You will receive notice by email when your account is about to expire. To renew, simply ask the PI that sponsored you (which will be included in the notice) to email '''cluster-admin@soe.ucsc.edu''' requesting that your account to be renewed for another year. If your account expires, the account will be suspended and you will no longer be able to login or view any data you may have in our systems. Any automated scripts (owned by you) that run via cron or other mechanisms will cease to function. 2e1f7a995170e96da3d1f86896ffadc8cfd4fec1 Firewalled Environment Storage Overview 0 39 341 2023-06-14T22:37:46Z Weiler 3 Created page with "== Server Types and Management== After confirming your VPN software is working, you can ssh into one of the compute servers behind the VPN: '''crimson.prism''': 256GB RAM,..." wikitext text/x-wiki == Server Types and Management== After confirming your VPN software is working, you can ssh into one of the compute servers behind the VPN: '''crimson.prism''': 256GB RAM, 32 cores, 5.5TB local scratch space, CentOS 7.9 '''razzmatazz.prism''': 256GB RAM, 32 cores, 5.5TB local scratch space, Ubuntu 22.04 '''mustard.prism''': 1.5TB RAM, 160 cores, 9TB local scratch space, Ubuntu 22.04 These servers are managed by the Genomics Institute Cluster Admin group. If you need software installed on any of these servers, please make your request by emailing cluster-admin@soe.ucsc.edu. == Storage == These servers mount two types of storage; home directories and group storage directories. '''Home Directories (/private/home/username)''' Your home directory will be located as "/private/home/username" and has a 30GB soft quota and a 31GB hard quota. Your home directory is meant for small scripts and login data, or a git repo. Please do not try to store large data there or computer on large jobs using data in your home directory. '''Groups Directories (/private/group/groupname''' The group storage directories are created per PI, and each group directory has a default 15TB soft quota and 16TB hard quota. For example, if David Haussler is the PI that you report to directly, then the directory would exist as /private/groups/hausslerlab. Request access to that group directory and you will then be able to write to it. Each of those group directories are shared by the lab it belongs to, so you must be wary of everyone's data usage and share the 15TB available per group accordingly. On the compute servers you can check your group's current quota usage by using the '/usr/bin/viewquota' command. You can only check the quota of a group you are part of (you would be a member of the UNIX group of the same name). If you wanted to check the quota usage of /private/groups/hausslerlab for example, you would do: $ viewquota hausslerlab Project quota on /export (/dev/mapper/export) Project ID Used Soft Hard Warn/Grace ---------- --------------------------------- hausslerlab 1.8T 15T 16T 00 [------] '''Soft Versus Hard Quotas''' We use soft and hard quotas for disk space. Once you exceed a directory's soft quota, a one-week countdown timer starts. When that timer runs out, you will no longer be able to create new files or write more data in that directory. You can reset the countdown timer by dropping down to under the soft quota limit. You will not be permitted to exceed a directory's hard quota at all. Any attempt to try will produce an error; the precise error will depend on how your software responds to running out of disk space. When quotas are first applied to a directory, or are reduced, it is possible to end up with more data or files in the directory than the quota allows for. This outcome does not trigger deletion of any existing data, but will prevent creation of new data or files. == /scratch Space on the Servers == Each server will generally have a local /scratch filesystem that you can use to store temporary files. '''BE ADVISED''' that /scratch is not backed up, and the data there could disappear in the event of a disk failure or anything else. Do not store important data there. If it is important, it should be moved somewhere else very soon after creation. == Actually Doing Work and Computing == When doing research, running jobs and the like, please be careful of your resource consumption on the server you are on. Don't run too many threads or cores at once if such a thing overruns the RAM available or the disk IO available. If you are not sure of your potential RAM, CPU or disk impact, start small with one or two processes and work your way up from there. Also, before running your stuff, check what else is already happening on the server by using the 'top' command to see who else and what else is running and what kind of resources are already being consumed. If, after starting a process, you realize that the server slows down considerably or becomes unusable, kill your processes and re-evaluate what you need to make things work. These servers are shared resources - be a good neighbor! == The Firewall == All servers are behind a firewall in this environment, and as such, you must connect to the VPN in order to access them. They will not be accessible from the greater Internet without VPN. Although you will be able to connect outbound from them to other servers on the internet to copy data in, sync git repos, stuff like that. It is only inbound connections that will be blocked. All machines behind the firewall have the private domain name suffix of "*.prism". == The Phoenix Cluster == This is a cluster of ~20 Ubuntu 22.04 nodes, some of which have GPUs in them. Each node generally has about 2TB RAM and 128 cores, although the cluster is heterogeneous and has multiple node types. The cluster head node, from which all jobs are submitted via the SLURM job scheduling framework, is '''phoenix.prism'''. To learn more about how to use Slurm, refer to: https://giwiki.gi.ucsc.edu/index.php/Genomics_Institute_Computing_Information#Slurm_at_the_Genomics_Institute For scratch on the cluster, TMPDIR will be set to /data/tmp. That area is cleaned often so don't store any data there that isn't being used by your jobs. 0376e579e59479ae6e98e8eaf8758e60929f98d6 342 341 2023-06-14T22:48:30Z Weiler 3 wikitext text/x-wiki == Server Types and Management== After confirming your VPN software is working, you can ssh into one of the compute servers behind the VPN: '''crimson.prism''': 256GB RAM, 32 cores, 5.5TB local scratch space, CentOS 7.9 '''razzmatazz.prism''': 256GB RAM, 32 cores, 5.5TB local scratch space, Ubuntu 22.04 '''mustard.prism''': 1.5TB RAM, 160 cores, 9TB local scratch space, Ubuntu 22.04 These servers are managed by the Genomics Institute Cluster Admin group. If you need software installed on any of these servers, please make your request by emailing cluster-admin@soe.ucsc.edu. == Storage == These servers mount two types of storage; home directories and group storage directories. '''Filesystem Specifications''' {| class="wikitable" |- style="font-weight:bold;" ! Filesystem<br /> ! /private/home ! /private/groups |- | style="text-align:center; font-weight:bold;" | Default Disk Space | | |- | style="text-align:center; font-weight:bold;" | Soft Quota | 30 GB | 15 TB |- | style="text-align:center; font-weight:bold;" | Hard Quota | 31 GB | 16 TB |- | style="font-weight:bold;" | Total Capacity | 19 TB | 500 TB |- | style="font-weight:bold;" | Access Speed | Slow - Moderate (Spinning Disk) | Very Fast (NVMe Flash Media) |- | style="font-weight:bold;" | Intended Use | This space should be used for login scripts, small bits of code or software repos, etc. No large data should be stored here. | This space should be used for large computational/shared data, large software installations and the like. |} '''Home Directories (/private/home/username)''' Your home directory will be located as "/private/home/username" and has a 30GB soft quota and a 31GB hard quota. Your home directory is meant for small scripts and login data, or a git repo. Please do not try to store large data there or computer on large jobs using data in your home directory. '''Groups Directories (/private/group/groupname''' The group storage directories are created per PI, and each group directory has a default 15TB soft quota and 16TB hard quota. For example, if David Haussler is the PI that you report to directly, then the directory would exist as /private/groups/hausslerlab. Request access to that group directory and you will then be able to write to it. Each of those group directories are shared by the lab it belongs to, so you must be wary of everyone's data usage and share the 15TB available per group accordingly. On the compute servers you can check your group's current quota usage by using the '/usr/bin/viewquota' command. You can only check the quota of a group you are part of (you would be a member of the UNIX group of the same name). If you wanted to check the quota usage of /private/groups/hausslerlab for example, you would do: $ viewquota hausslerlab Project quota on /export (/dev/mapper/export) Project ID Used Soft Hard Warn/Grace ---------- --------------------------------- hausslerlab 1.8T 15T 16T 00 [------] '''Soft Versus Hard Quotas''' We use soft and hard quotas for disk space. Once you exceed a directory's soft quota, a one-week countdown timer starts. When that timer runs out, you will no longer be able to create new files or write more data in that directory. You can reset the countdown timer by dropping down to under the soft quota limit. You will not be permitted to exceed a directory's hard quota at all. Any attempt to try will produce an error; the precise error will depend on how your software responds to running out of disk space. When quotas are first applied to a directory, or are reduced, it is possible to end up with more data or files in the directory than the quota allows for. This outcome does not trigger deletion of any existing data, but will prevent creation of new data or files. == /scratch Space on the Servers == Each server will generally have a local /scratch filesystem that you can use to store temporary files. '''BE ADVISED''' that /scratch is not backed up, and the data there could disappear in the event of a disk failure or anything else. Do not store important data there. If it is important, it should be moved somewhere else very soon after creation. == Actually Doing Work and Computing == When doing research, running jobs and the like, please be careful of your resource consumption on the server you are on. Don't run too many threads or cores at once if such a thing overruns the RAM available or the disk IO available. If you are not sure of your potential RAM, CPU or disk impact, start small with one or two processes and work your way up from there. Also, before running your stuff, check what else is already happening on the server by using the 'top' command to see who else and what else is running and what kind of resources are already being consumed. If, after starting a process, you realize that the server slows down considerably or becomes unusable, kill your processes and re-evaluate what you need to make things work. These servers are shared resources - be a good neighbor! == The Firewall == All servers are behind a firewall in this environment, and as such, you must connect to the VPN in order to access them. They will not be accessible from the greater Internet without VPN. Although you will be able to connect outbound from them to other servers on the internet to copy data in, sync git repos, stuff like that. It is only inbound connections that will be blocked. All machines behind the firewall have the private domain name suffix of "*.prism". == The Phoenix Cluster == This is a cluster of ~20 Ubuntu 22.04 nodes, some of which have GPUs in them. Each node generally has about 2TB RAM and 128 cores, although the cluster is heterogeneous and has multiple node types. The cluster head node, from which all jobs are submitted via the SLURM job scheduling framework, is '''phoenix.prism'''. To learn more about how to use Slurm, refer to: https://giwiki.gi.ucsc.edu/index.php/Genomics_Institute_Computing_Information#Slurm_at_the_Genomics_Institute For scratch on the cluster, TMPDIR will be set to /data/tmp. That area is cleaned often so don't store any data there that isn't being used by your jobs. 2a61e43a52e41c22dbeac689acfbd09036a45c4f 343 342 2023-06-14T22:50:59Z Weiler 3 /* Storage */ wikitext text/x-wiki == Server Types and Management== After confirming your VPN software is working, you can ssh into one of the compute servers behind the VPN: '''crimson.prism''': 256GB RAM, 32 cores, 5.5TB local scratch space, CentOS 7.9 '''razzmatazz.prism''': 256GB RAM, 32 cores, 5.5TB local scratch space, Ubuntu 22.04 '''mustard.prism''': 1.5TB RAM, 160 cores, 9TB local scratch space, Ubuntu 22.04 These servers are managed by the Genomics Institute Cluster Admin group. If you need software installed on any of these servers, please make your request by emailing cluster-admin@soe.ucsc.edu. == Storage == These servers mount two types of storage; home directories and group storage directories. '''Filesystem Specifications''' {| class="wikitable" style="text-align:center;" |- style="font-weight:bold; text-align:left;" ! Filesystem<br /> ! /private/home ! /private/groups |- | style="font-weight:bold; text-align:left;" | Default Soft Quota | 30 GB | 15 TB |- | style="font-weight:bold;" | Default Hard Quota | 31 GB | 16 TB |- | style="font-weight:bold; text-align:left;" | Total Capacity | 19 TB | 500 TB |- style="text-align:left;" | style="font-weight:bold;" | Access Speed | Slow - Moderate (Spinning Disk) | Very Fast (NVMe Flash Media) |- style="text-align:left;" | style="font-weight:bold;" | Intended Use | This space should be used for login scripts, small bits of code or software repos, etc. No large data should be stored here. | This space should be used for large computational/shared data, large software installations and the like. |} '''Home Directories (/private/home/username)''' Your home directory will be located as "/private/home/username" and has a 30GB soft quota and a 31GB hard quota. Your home directory is meant for small scripts and login data, or a git repo. Please do not try to store large data there or computer on large jobs using data in your home directory. '''Groups Directories (/private/group/groupname''' The group storage directories are created per PI, and each group directory has a default 15TB soft quota and 16TB hard quota. For example, if David Haussler is the PI that you report to directly, then the directory would exist as /private/groups/hausslerlab. Request access to that group directory and you will then be able to write to it. Each of those group directories are shared by the lab it belongs to, so you must be wary of everyone's data usage and share the 15TB available per group accordingly. On the compute servers you can check your group's current quota usage by using the '/usr/bin/viewquota' command. You can only check the quota of a group you are part of (you would be a member of the UNIX group of the same name). If you wanted to check the quota usage of /private/groups/hausslerlab for example, you would do: $ viewquota hausslerlab Project quota on /export (/dev/mapper/export) Project ID Used Soft Hard Warn/Grace ---------- --------------------------------- hausslerlab 1.8T 15T 16T 00 [------] '''Soft Versus Hard Quotas''' We use soft and hard quotas for disk space. Once you exceed a directory's soft quota, a one-week countdown timer starts. When that timer runs out, you will no longer be able to create new files or write more data in that directory. You can reset the countdown timer by dropping down to under the soft quota limit. You will not be permitted to exceed a directory's hard quota at all. Any attempt to try will produce an error; the precise error will depend on how your software responds to running out of disk space. When quotas are first applied to a directory, or are reduced, it is possible to end up with more data or files in the directory than the quota allows for. This outcome does not trigger deletion of any existing data, but will prevent creation of new data or files. == /scratch Space on the Servers == Each server will generally have a local /scratch filesystem that you can use to store temporary files. '''BE ADVISED''' that /scratch is not backed up, and the data there could disappear in the event of a disk failure or anything else. Do not store important data there. If it is important, it should be moved somewhere else very soon after creation. == Actually Doing Work and Computing == When doing research, running jobs and the like, please be careful of your resource consumption on the server you are on. Don't run too many threads or cores at once if such a thing overruns the RAM available or the disk IO available. If you are not sure of your potential RAM, CPU or disk impact, start small with one or two processes and work your way up from there. Also, before running your stuff, check what else is already happening on the server by using the 'top' command to see who else and what else is running and what kind of resources are already being consumed. If, after starting a process, you realize that the server slows down considerably or becomes unusable, kill your processes and re-evaluate what you need to make things work. These servers are shared resources - be a good neighbor! == The Firewall == All servers are behind a firewall in this environment, and as such, you must connect to the VPN in order to access them. They will not be accessible from the greater Internet without VPN. Although you will be able to connect outbound from them to other servers on the internet to copy data in, sync git repos, stuff like that. It is only inbound connections that will be blocked. All machines behind the firewall have the private domain name suffix of "*.prism". == The Phoenix Cluster == This is a cluster of ~20 Ubuntu 22.04 nodes, some of which have GPUs in them. Each node generally has about 2TB RAM and 128 cores, although the cluster is heterogeneous and has multiple node types. The cluster head node, from which all jobs are submitted via the SLURM job scheduling framework, is '''phoenix.prism'''. To learn more about how to use Slurm, refer to: https://giwiki.gi.ucsc.edu/index.php/Genomics_Institute_Computing_Information#Slurm_at_the_Genomics_Institute For scratch on the cluster, TMPDIR will be set to /data/tmp. That area is cleaned often so don't store any data there that isn't being used by your jobs. 40b8493a1a032bf80335339962d4ad7d7828d4fe 344 343 2023-06-14T22:52:02Z Weiler 3 /* Storage */ wikitext text/x-wiki == Server Types and Management== After confirming your VPN software is working, you can ssh into one of the compute servers behind the VPN: '''crimson.prism''': 256GB RAM, 32 cores, 5.5TB local scratch space, CentOS 7.9 '''razzmatazz.prism''': 256GB RAM, 32 cores, 5.5TB local scratch space, Ubuntu 22.04 '''mustard.prism''': 1.5TB RAM, 160 cores, 9TB local scratch space, Ubuntu 22.04 These servers are managed by the Genomics Institute Cluster Admin group. If you need software installed on any of these servers, please make your request by emailing cluster-admin@soe.ucsc.edu. == Storage == These servers mount two types of storage; home directories and group storage directories. '''Filesystem Specifications''' {| class="wikitable" style="text-align:center;" |- style="font-weight:bold; text-align:left;" ! Filesystem<br /> ! /private/home ! /private/groups |- | style="font-weight:bold; text-align:left;" | Default Soft Quota | 30 GB | 15 TB |- | style="font-weight:bold; text-align:left;" | Default Hard Quota | 31 GB | 16 TB |- | style="font-weight:bold; text-align:left;" | Total Capacity | 19 TB | 500 TB |- style="text-align:left;" | style="font-weight:bold;" | Access Speed | Slow - Moderate (Spinning Disk) | Very Fast (NVMe Flash Media) |- style="text-align:left;" | style="font-weight:bold;" | Intended Use | This space should be used for login scripts, small bits of code or software repos, etc. No large data should be stored here. | This space should be used for large computational/shared data, large software installations and the like. |} '''Home Directories (/private/home/username)''' Your home directory will be located as "/private/home/username" and has a 30GB soft quota and a 31GB hard quota. Your home directory is meant for small scripts and login data, or a git repo. Please do not try to store large data there or computer on large jobs using data in your home directory. '''Groups Directories (/private/group/groupname''' The group storage directories are created per PI, and each group directory has a default 15TB soft quota and 16TB hard quota. For example, if David Haussler is the PI that you report to directly, then the directory would exist as /private/groups/hausslerlab. Request access to that group directory and you will then be able to write to it. Each of those group directories are shared by the lab it belongs to, so you must be wary of everyone's data usage and share the 15TB available per group accordingly. On the compute servers you can check your group's current quota usage by using the '/usr/bin/viewquota' command. You can only check the quota of a group you are part of (you would be a member of the UNIX group of the same name). If you wanted to check the quota usage of /private/groups/hausslerlab for example, you would do: $ viewquota hausslerlab Project quota on /export (/dev/mapper/export) Project ID Used Soft Hard Warn/Grace ---------- --------------------------------- hausslerlab 1.8T 15T 16T 00 [------] '''Soft Versus Hard Quotas''' We use soft and hard quotas for disk space. Once you exceed a directory's soft quota, a one-week countdown timer starts. When that timer runs out, you will no longer be able to create new files or write more data in that directory. You can reset the countdown timer by dropping down to under the soft quota limit. You will not be permitted to exceed a directory's hard quota at all. Any attempt to try will produce an error; the precise error will depend on how your software responds to running out of disk space. When quotas are first applied to a directory, or are reduced, it is possible to end up with more data or files in the directory than the quota allows for. This outcome does not trigger deletion of any existing data, but will prevent creation of new data or files. == /scratch Space on the Servers == Each server will generally have a local /scratch filesystem that you can use to store temporary files. '''BE ADVISED''' that /scratch is not backed up, and the data there could disappear in the event of a disk failure or anything else. Do not store important data there. If it is important, it should be moved somewhere else very soon after creation. == Actually Doing Work and Computing == When doing research, running jobs and the like, please be careful of your resource consumption on the server you are on. Don't run too many threads or cores at once if such a thing overruns the RAM available or the disk IO available. If you are not sure of your potential RAM, CPU or disk impact, start small with one or two processes and work your way up from there. Also, before running your stuff, check what else is already happening on the server by using the 'top' command to see who else and what else is running and what kind of resources are already being consumed. If, after starting a process, you realize that the server slows down considerably or becomes unusable, kill your processes and re-evaluate what you need to make things work. These servers are shared resources - be a good neighbor! == The Firewall == All servers are behind a firewall in this environment, and as such, you must connect to the VPN in order to access them. They will not be accessible from the greater Internet without VPN. Although you will be able to connect outbound from them to other servers on the internet to copy data in, sync git repos, stuff like that. It is only inbound connections that will be blocked. All machines behind the firewall have the private domain name suffix of "*.prism". == The Phoenix Cluster == This is a cluster of ~20 Ubuntu 22.04 nodes, some of which have GPUs in them. Each node generally has about 2TB RAM and 128 cores, although the cluster is heterogeneous and has multiple node types. The cluster head node, from which all jobs are submitted via the SLURM job scheduling framework, is '''phoenix.prism'''. To learn more about how to use Slurm, refer to: https://giwiki.gi.ucsc.edu/index.php/Genomics_Institute_Computing_Information#Slurm_at_the_Genomics_Institute For scratch on the cluster, TMPDIR will be set to /data/tmp. That area is cleaned often so don't store any data there that isn't being used by your jobs. 357eef316ce52bb2f2a1ca0882805561fbb51f40 345 344 2023-06-14T22:54:50Z Weiler 3 wikitext text/x-wiki == Server Types and Management== After confirming your VPN software is working, you can ssh into one of the compute servers behind the VPN: '''crimson.prism''': 256GB RAM, 32 cores, 5.5TB local scratch space, CentOS 7.9 '''razzmatazz.prism''': 256GB RAM, 32 cores, 5.5TB local scratch space, Ubuntu 22.04 '''mustard.prism''': 1.5TB RAM, 160 cores, 9TB local scratch space, Ubuntu 22.04 These servers are managed by the Genomics Institute Cluster Admin group. If you need software installed on any of these servers, please make your request by emailing cluster-admin@soe.ucsc.edu. == Storage == These servers mount two types of storage; home directories and group storage directories. '''Filesystem Specifications''' {| class="wikitable" style="text-align:center;" |- style="font-weight:bold; text-align:left;" ! Filesystem<br /> ! /private/home ! /private/groups |- | style="font-weight:bold; text-align:left;" | Default Soft Quota | 30 GB | 15 TB |- | style="font-weight:bold; text-align:left;" | Default Hard Quota | 31 GB | 16 TB |- | style="font-weight:bold; text-align:left;" | Total Capacity | 19 TB | 500 TB |- style="text-align:left;" | style="font-weight:bold;" | Access Speed | Slow - Moderate (Spinning Disk) | Very Fast (NVMe Flash Media) |- style="text-align:left;" | style="font-weight:bold;" | Intended Use | This space should be used for login scripts, small bits of code or software repos, etc. No large data should be stored here. | This space should be used for large computational/shared data, large software installations and the like. |} '''Home Directories (/private/home/username)''' Your home directory will be located as "/private/home/username" and has a 30GB soft quota and a 31GB hard quota. Your home directory is meant for small scripts and login data, or a git repo. Please do not try to store large data there or computer on large jobs using data in your home directory. '''Groups Directories (/private/group/groupname''' The group storage directories are created per PI, and each group directory has a default 15TB soft quota and 16TB hard quota. For example, if David Haussler is the PI that you report to directly, then the directory would exist as /private/groups/hausslerlab. Request access to that group directory and you will then be able to write to it. Each of those group directories are shared by the lab it belongs to, so you must be wary of everyone's data usage and share the 15TB available per group accordingly. On the compute servers you can check your group's current quota usage by using the '/usr/bin/viewquota' command. You can only check the quota of a group you are part of (you would be a member of the UNIX group of the same name). If you wanted to check the quota usage of /private/groups/hausslerlab for example, you would do: $ viewquota hausslerlab Project quota on /export (/dev/mapper/export) Project ID Used Soft Hard Warn/Grace ---------- --------------------------------- hausslerlab 1.8T 15T 16T 00 [------] '''Soft Versus Hard Quotas''' We use soft and hard quotas for disk space. Once you exceed a directory's soft quota, a one-week countdown timer starts. When that timer runs out, you will no longer be able to create new files or write more data in that directory. You can reset the countdown timer by dropping down to under the soft quota limit. You will not be permitted to exceed a directory's hard quota at all. Any attempt to try will produce an error; the precise error will depend on how your software responds to running out of disk space. When quotas are first applied to a directory, or are reduced, it is possible to end up with more data or files in the directory than the quota allows for. This outcome does not trigger deletion of any existing data, but will prevent creation of new data or files. == /scratch Space on the Servers == Each server will generally have a local /scratch filesystem that you can use to store temporary files. '''BE ADVISED''' that /scratch is not backed up, and the data there could disappear in the event of a disk failure or anything else. Do not store important data there. If it is important, it should be moved somewhere else very soon after creation. 556008ab83e2de08a75ed89e0aac0fcaf88c7a5a 348 345 2023-06-14T22:56:19Z Weiler 3 /* Server Types and Management */ wikitext text/x-wiki == Storage == These servers mount two types of storage; home directories and group storage directories. '''Filesystem Specifications''' {| class="wikitable" style="text-align:center;" |- style="font-weight:bold; text-align:left;" ! Filesystem<br /> ! /private/home ! /private/groups |- | style="font-weight:bold; text-align:left;" | Default Soft Quota | 30 GB | 15 TB |- | style="font-weight:bold; text-align:left;" | Default Hard Quota | 31 GB | 16 TB |- | style="font-weight:bold; text-align:left;" | Total Capacity | 19 TB | 500 TB |- style="text-align:left;" | style="font-weight:bold;" | Access Speed | Slow - Moderate (Spinning Disk) | Very Fast (NVMe Flash Media) |- style="text-align:left;" | style="font-weight:bold;" | Intended Use | This space should be used for login scripts, small bits of code or software repos, etc. No large data should be stored here. | This space should be used for large computational/shared data, large software installations and the like. |} '''Home Directories (/private/home/username)''' Your home directory will be located as "/private/home/username" and has a 30GB soft quota and a 31GB hard quota. Your home directory is meant for small scripts and login data, or a git repo. Please do not try to store large data there or computer on large jobs using data in your home directory. '''Groups Directories (/private/group/groupname''' The group storage directories are created per PI, and each group directory has a default 15TB soft quota and 16TB hard quota. For example, if David Haussler is the PI that you report to directly, then the directory would exist as /private/groups/hausslerlab. Request access to that group directory and you will then be able to write to it. Each of those group directories are shared by the lab it belongs to, so you must be wary of everyone's data usage and share the 15TB available per group accordingly. On the compute servers you can check your group's current quota usage by using the '/usr/bin/viewquota' command. You can only check the quota of a group you are part of (you would be a member of the UNIX group of the same name). If you wanted to check the quota usage of /private/groups/hausslerlab for example, you would do: $ viewquota hausslerlab Project quota on /export (/dev/mapper/export) Project ID Used Soft Hard Warn/Grace ---------- --------------------------------- hausslerlab 1.8T 15T 16T 00 [------] '''Soft Versus Hard Quotas''' We use soft and hard quotas for disk space. Once you exceed a directory's soft quota, a one-week countdown timer starts. When that timer runs out, you will no longer be able to create new files or write more data in that directory. You can reset the countdown timer by dropping down to under the soft quota limit. You will not be permitted to exceed a directory's hard quota at all. Any attempt to try will produce an error; the precise error will depend on how your software responds to running out of disk space. When quotas are first applied to a directory, or are reduced, it is possible to end up with more data or files in the directory than the quota allows for. This outcome does not trigger deletion of any existing data, but will prevent creation of new data or files. == /scratch Space on the Servers == Each server will generally have a local /scratch filesystem that you can use to store temporary files. '''BE ADVISED''' that /scratch is not backed up, and the data there could disappear in the event of a disk failure or anything else. Do not store important data there. If it is important, it should be moved somewhere else very soon after creation. a8ae851de4ed072cbbaa4c363b5e273311b767f3 Computing Resources Overview 0 40 347 2023-06-14T22:55:43Z Weiler 3 Created page with "== Doing Work and Computing == When doing research, running jobs and the like, please be careful of your resource consumption on the server you are on. Don't run too many th..." wikitext text/x-wiki == Doing Work and Computing == When doing research, running jobs and the like, please be careful of your resource consumption on the server you are on. Don't run too many threads or cores at once if such a thing overruns the RAM available or the disk IO available. If you are not sure of your potential RAM, CPU or disk impact, start small with one or two processes and work your way up from there. Also, before running your stuff, check what else is already happening on the server by using the 'top' command to see who else and what else is running and what kind of resources are already being consumed. If, after starting a process, you realize that the server slows down considerably or becomes unusable, kill your processes and re-evaluate what you need to make things work. These servers are shared resources - be a good neighbor! == The Firewall == All servers are behind a firewall in this environment, and as such, you must connect to the VPN in order to access them. They will not be accessible from the greater Internet without VPN. Although you will be able to connect outbound from them to other servers on the internet to copy data in, sync git repos, stuff like that. It is only inbound connections that will be blocked. All machines behind the firewall have the private domain name suffix of "*.prism". == The Phoenix Cluster == This is a cluster of ~20 Ubuntu 22.04 nodes, some of which have GPUs in them. Each node generally has about 2TB RAM and 128 cores, although the cluster is heterogeneous and has multiple node types. The cluster head node, from which all jobs are submitted via the SLURM job scheduling framework, is '''phoenix.prism'''. To learn more about how to use Slurm, refer to: https://giwiki.gi.ucsc.edu/index.php/Genomics_Institute_Computing_Information#Slurm_at_the_Genomics_Institute For scratch on the cluster, TMPDIR will be set to /data/tmp. That area is cleaned often so don't store any data there that isn't being used by your jobs. f4760b5e4dcd10c3a04d180b74f7af1c50d28c84 349 347 2023-06-14T22:57:02Z Weiler 3 wikitext text/x-wiki == Doing Work and Computing == When doing research, running jobs and the like, please be careful of your resource consumption on the server you are on. Don't run too many threads or cores at once if such a thing overruns the RAM available or the disk IO available. If you are not sure of your potential RAM, CPU or disk impact, start small with one or two processes and work your way up from there. Also, before running your stuff, check what else is already happening on the server by using the 'top' command to see who else and what else is running and what kind of resources are already being consumed. If, after starting a process, you realize that the server slows down considerably or becomes unusable, kill your processes and re-evaluate what you need to make things work. These servers are shared resources - be a good neighbor! == Server Types and Management == After confirming your VPN software is working, you can ssh into one of the compute servers behind the VPN: '''crimson.prism''': 256GB RAM, 32 cores, 5.5TB local scratch space, CentOS 7.9 '''razzmatazz.prism''': 256GB RAM, 32 cores, 5.5TB local scratch space, Ubuntu 22.04 '''mustard.prism''': 1.5TB RAM, 160 cores, 9TB local scratch space, Ubuntu 22.04 These servers are managed by the Genomics Institute Cluster Admin group. If you need software installed on any of these servers, please make your request by emailing cluster-admin@soe.ucsc.edu. == The Firewall == All servers are behind a firewall in this environment, and as such, you must connect to the VPN in order to access them. They will not be accessible from the greater Internet without VPN. Although you will be able to connect outbound from them to other servers on the internet to copy data in, sync git repos, stuff like that. It is only inbound connections that will be blocked. All machines behind the firewall have the private domain name suffix of "*.prism". == The Phoenix Cluster == This is a cluster of ~20 Ubuntu 22.04 nodes, some of which have GPUs in them. Each node generally has about 2TB RAM and 128 cores, although the cluster is heterogeneous and has multiple node types. The cluster head node, from which all jobs are submitted via the SLURM job scheduling framework, is '''phoenix.prism'''. To learn more about how to use Slurm, refer to: https://giwiki.gi.ucsc.edu/index.php/Genomics_Institute_Computing_Information#Slurm_at_the_Genomics_Institute For scratch on the cluster, TMPDIR will be set to /data/tmp. That area is cleaned often so don't store any data there that isn't being used by your jobs. 0f7053a6338026220b2156b57fedb43a28b0d6f4 350 349 2023-06-14T23:03:58Z Weiler 3 /* Server Types and Management */ wikitext text/x-wiki == Doing Work and Computing == When doing research, running jobs and the like, please be careful of your resource consumption on the server you are on. Don't run too many threads or cores at once if such a thing overruns the RAM available or the disk IO available. If you are not sure of your potential RAM, CPU or disk impact, start small with one or two processes and work your way up from there. Also, before running your stuff, check what else is already happening on the server by using the 'top' command to see who else and what else is running and what kind of resources are already being consumed. If, after starting a process, you realize that the server slows down considerably or becomes unusable, kill your processes and re-evaluate what you need to make things work. These servers are shared resources - be a good neighbor! == Server Types and Management == After confirming your VPN software is working, you can ssh into one of the compute servers behind the VPN: {| class="wikitable" |- style="font-weight:bold;" ! Server Name ! Operating System<br /> ! CPU Cores ! Memory ! Network Bandwidth ! Scratch Space |- | mustard | Ubuntu 22.04 | style="text-align:center;" | 160 | 1.5 TB | 10 Gb/s | 9 TB |- | crimson | CentOS 7.5 | style="text-align:center;" | 32 | 256 GB | 10 Gb/s | 5.5 TB |- | razzmatazz | Ubuntu 22.04 | style="text-align:center;" | 32 | 256 GB | 10 Gb/s | 5.5 TB |} These servers are managed by the Genomics Institute Cluster Admin group. If you need software installed on any of these servers, please make your request by emailing cluster-admin@soe.ucsc.edu. == The Firewall == All servers are behind a firewall in this environment, and as such, you must connect to the VPN in order to access them. They will not be accessible from the greater Internet without VPN. Although you will be able to connect outbound from them to other servers on the internet to copy data in, sync git repos, stuff like that. It is only inbound connections that will be blocked. All machines behind the firewall have the private domain name suffix of "*.prism". == The Phoenix Cluster == This is a cluster of ~20 Ubuntu 22.04 nodes, some of which have GPUs in them. Each node generally has about 2TB RAM and 128 cores, although the cluster is heterogeneous and has multiple node types. The cluster head node, from which all jobs are submitted via the SLURM job scheduling framework, is '''phoenix.prism'''. To learn more about how to use Slurm, refer to: https://giwiki.gi.ucsc.edu/index.php/Genomics_Institute_Computing_Information#Slurm_at_the_Genomics_Institute For scratch on the cluster, TMPDIR will be set to /data/tmp. That area is cleaned often so don't store any data there that isn't being used by your jobs. b83bc7e95b18b182acfa238dce8863e1a4ceacd5 351 350 2023-06-14T23:07:49Z Weiler 3 /* Server Types and Management */ wikitext text/x-wiki == Doing Work and Computing == When doing research, running jobs and the like, please be careful of your resource consumption on the server you are on. Don't run too many threads or cores at once if such a thing overruns the RAM available or the disk IO available. If you are not sure of your potential RAM, CPU or disk impact, start small with one or two processes and work your way up from there. Also, before running your stuff, check what else is already happening on the server by using the 'top' command to see who else and what else is running and what kind of resources are already being consumed. If, after starting a process, you realize that the server slows down considerably or becomes unusable, kill your processes and re-evaluate what you need to make things work. These servers are shared resources - be a good neighbor! == Server Types and Management == After confirming your VPN software is working, you can ssh into one of the shared compute servers behind the VPN: {| class="wikitable" style="text-align:center;" |- style="font-weight:bold; text-align:left;" ! Node Name ! Operating System<br /> ! CPU Cores ! Memory ! Network Bandwidth ! Scratch Space |- | style="text-align:left;" | mustard | style="text-align:left;" | Ubuntu 22.04 | 160 | 1.5 TB | 10 Gb/s | 9 TB |- | style="text-align:left;" | crimson | style="text-align:left;" | CentOS 7.5 | 32 | 256 GB | 10 Gb/s | 5.5 TB |- | style="text-align:left;" | razzmatazz | style="text-align:left;" | Ubuntu 22.04 | 32 | 256 GB | 10 Gb/s | 5.5 TB |} These ''shared'' servers are managed by the Genomics Institute Cluster Admin group. If you need software installed on any of these servers, please make your request by emailing cluster-admin@soe.ucsc.edu. == The Firewall == All servers are behind a firewall in this environment, and as such, you must connect to the VPN in order to access them. They will not be accessible from the greater Internet without VPN. Although you will be able to connect outbound from them to other servers on the internet to copy data in, sync git repos, stuff like that. It is only inbound connections that will be blocked. All machines behind the firewall have the private domain name suffix of "*.prism". == The Phoenix Cluster == This is a cluster of ~20 Ubuntu 22.04 nodes, some of which have GPUs in them. Each node generally has about 2TB RAM and 128 cores, although the cluster is heterogeneous and has multiple node types. The cluster head node, from which all jobs are submitted via the SLURM job scheduling framework, is '''phoenix.prism'''. To learn more about how to use Slurm, refer to: https://giwiki.gi.ucsc.edu/index.php/Genomics_Institute_Computing_Information#Slurm_at_the_Genomics_Institute For scratch on the cluster, TMPDIR will be set to /data/tmp. That area is cleaned often so don't store any data there that isn't being used by your jobs. e81830a637aea40ba170002d7463fe0a6c7ec971 Computing Resources Overview 0 40 352 351 2023-06-14T23:17:24Z Weiler 3 /* The Phoenix Cluster */ wikitext text/x-wiki == Doing Work and Computing == When doing research, running jobs and the like, please be careful of your resource consumption on the server you are on. Don't run too many threads or cores at once if such a thing overruns the RAM available or the disk IO available. If you are not sure of your potential RAM, CPU or disk impact, start small with one or two processes and work your way up from there. Also, before running your stuff, check what else is already happening on the server by using the 'top' command to see who else and what else is running and what kind of resources are already being consumed. If, after starting a process, you realize that the server slows down considerably or becomes unusable, kill your processes and re-evaluate what you need to make things work. These servers are shared resources - be a good neighbor! == Server Types and Management == After confirming your VPN software is working, you can ssh into one of the shared compute servers behind the VPN: {| class="wikitable" style="text-align:center;" |- style="font-weight:bold; text-align:left;" ! Node Name ! Operating System<br /> ! CPU Cores ! Memory ! Network Bandwidth ! Scratch Space |- | style="text-align:left;" | mustard | style="text-align:left;" | Ubuntu 22.04 | 160 | 1.5 TB | 10 Gb/s | 9 TB |- | style="text-align:left;" | crimson | style="text-align:left;" | CentOS 7.5 | 32 | 256 GB | 10 Gb/s | 5.5 TB |- | style="text-align:left;" | razzmatazz | style="text-align:left;" | Ubuntu 22.04 | 32 | 256 GB | 10 Gb/s | 5.5 TB |} These ''shared'' servers are managed by the Genomics Institute Cluster Admin group. If you need software installed on any of these servers, please make your request by emailing cluster-admin@soe.ucsc.edu. == The Firewall == All servers are behind a firewall in this environment, and as such, you must connect to the VPN in order to access them. They will not be accessible from the greater Internet without VPN. Although you will be able to connect outbound from them to other servers on the internet to copy data in, sync git repos, stuff like that. It is only inbound connections that will be blocked. All machines behind the firewall have the private domain name suffix of "*.prism". == The Phoenix Cluster == This is a cluster of ~20 Ubuntu 22.04 nodes, some of which have GPUs in them. Each node generally has about 2TB RAM and 128 cores, although the cluster is heterogeneous and has multiple node types. {| class="wikitable" style="text-align:center;" |- style="font-weight:bold;" ! Node Name ! Operating System<br /> ! CPU Cores ! GPUs/Type ! Memory ! Network Bandwidth ! Scratch Space |- | style="text-align:left;" | phoenix-00 | style="text-align:left;" | Ubuntu 22.04 | 256 | 8 / nVidia A100 | 1 TB | 10 Gb/s | 16 TB NVMe |- | style="text-align:left;" | phoenix-[01-05] | style="text-align:left;" | Ubuntu 22.04 | 128 | 8 / nVidia A5500 | 2 TB | 10 Gb/s | 16 TB NVMe |- | style="text-align:left;" | phoenix-[06-08] | style="text-align:left;" | Ubuntu 22.04 | 128 | N/A | 2 TB | 10 Gb/s | 16 TB NVMe |- | style="text-align:left;" | phoenix-09 | style="text-align:left;" | Ubuntu 22.04 | 40 | 8 / nVidia 1080ti | 256 GB | 10 Gb/s | 1.5 TB NVMe |- | style="text-align:left;" | phoenix-10 | style="text-align:left;" | Ubuntu 22.04 | 36 | 4 / nVidia 2080ti | 385 GB | 10 Gb/s | 3 TB NVMe |- | style="text-align:left;" | phoenix-[11-20] | style="text-align:left;" | Ubuntu 22.04 | 128 | N/A | 2 TB | 10 Gb/s | 16 TB NVMe |} The cluster head node, from which all jobs are submitted via the SLURM job scheduling framework, is '''phoenix.prism'''. To learn more about how to use Slurm, refer to: https://giwiki.gi.ucsc.edu/index.php/Genomics_Institute_Computing_Information#Slurm_at_the_Genomics_Institute For scratch on the cluster, TMPDIR will be set to /data/tmp. That area is cleaned often so don't store any data there that isn't being used by your jobs. 4640cadc8eb15f88448b64e5ebccb861bfdac95c Genomics Institute Computing Information 0 6 353 346 2023-06-14T23:19:30Z Weiler 3 wikitext text/x-wiki Welcome to the Genomic Institute Computing Information Repository! Browse the below topics for help in the area you are curious about. ==GI Public Computing Environment== *[[How to access the public servers]] ==GI Firewalled Computing Environment (PRISM)== *[[Access to the Firewalled Compute Servers]] *[[Firewalled Computing Resources Overview]] *[[Firewalled Environment Storage Overview]] ==VPN Access== *[[Requirement for users to get GI VPN access]] == NIH dbGaP Access Requirements == *[[Requirements for dbGaP Access]] == giCloud Openstack == *[[Overview of giCloud in the Genomics Institute]] *[[Quick Start Instructions to Get Rolling with OpenStack]] == Amazon Web Services Information == *[[Overview of Getting and Using an AWS IAM Account]] *[[AWS Account List and Numbers]] *[[AWS Shared Bucket Usage Graphs]] *[[AWS Best Practices]] *[[AWS S3 Lifecycle Management]] == Slurm at the Genomics Institute == *[[Overview of using Slurm]] *[[Annotated Slurm Script]] *[[Job Arrays]] *[[GPU Resources]] *[[Quick Reference Guide]] *[[Slurm Tips for vg]] *[[Slurm Tips for Toil]] ==Kubernetes Information== *[[Computational Genomics Kubernetes Installation]] *[[Undiagnosed Disease Project Kubernetes Installation]] == Problems or technical support == If you have any problems with the GI computing environment, please send an email to '''cluster-admin@soe.ucsc.edu''' ab4a48885cea5a7b5207452e78cb0a5d7b6518ed 355 353 2023-06-14T23:21:40Z Weiler 3 wikitext text/x-wiki Welcome to the Genomic Institute Computing Information Repository! Browse the below topics for help in the area you are curious about. == GI Public Computing Environment == *[[How to access the public servers]] == GI Firewalled Computing Environment (PRISM) == *[[Access to the Firewalled Compute Servers]] *[[Firewalled Computing Resources Overview]] *[[Firewalled Environment Storage Overview]] ==VPN Access== *[[Requirement for users to get GI VPN access]] == NIH dbGaP Access Requirements == *[[Requirements for dbGaP Access]] == giCloud Openstack == *[[Overview of giCloud in the Genomics Institute]] *[[Quick Start Instructions to Get Rolling with OpenStack]] == Amazon Web Services Information == *[[Overview of Getting and Using an AWS IAM Account]] *[[AWS Account List and Numbers]] *[[AWS Shared Bucket Usage Graphs]] *[[AWS Best Practices]] *[[AWS S3 Lifecycle Management]] == Slurm at the Genomics Institute == *[[Overview of using Slurm]] *[[Annotated Slurm Script]] *[[Job Arrays]] *[[GPU Resources]] *[[Quick Reference Guide]] *[[Slurm Tips for vg]] *[[Slurm Tips for Toil]] ==Kubernetes Information== *[[Computational Genomics Kubernetes Installation]] *[[Undiagnosed Disease Project Kubernetes Installation]] == Problems or technical support == If you have any problems with the GI computing environment, please send an email to '''cluster-admin@soe.ucsc.edu''' bf78e1af113f197fa0af99ad4dba96ee404ef5de 362 355 2023-06-14T23:28:02Z Weiler 3 /* GI Firewalled Computing Environment (PRISM) */ wikitext text/x-wiki Welcome to the Genomic Institute Computing Information Repository! Browse the below topics for help in the area you are curious about. == GI Public Computing Environment == *[[How to access the public servers]] == GI Firewalled Computing Environment (PRISM) == *[[Access to the Firewalled Compute Servers]] *[[Firewalled Computing Resources Overview]] *[[Firewalled Environment Storage Overview]] *[[Firewalled Storage Cost]] ==VPN Access== *[[Requirement for users to get GI VPN access]] == NIH dbGaP Access Requirements == *[[Requirements for dbGaP Access]] == giCloud Openstack == *[[Overview of giCloud in the Genomics Institute]] *[[Quick Start Instructions to Get Rolling with OpenStack]] == Amazon Web Services Information == *[[Overview of Getting and Using an AWS IAM Account]] *[[AWS Account List and Numbers]] *[[AWS Shared Bucket Usage Graphs]] *[[AWS Best Practices]] *[[AWS S3 Lifecycle Management]] == Slurm at the Genomics Institute == *[[Overview of using Slurm]] *[[Annotated Slurm Script]] *[[Job Arrays]] *[[GPU Resources]] *[[Quick Reference Guide]] *[[Slurm Tips for vg]] *[[Slurm Tips for Toil]] ==Kubernetes Information== *[[Computational Genomics Kubernetes Installation]] *[[Undiagnosed Disease Project Kubernetes Installation]] == Problems or technical support == If you have any problems with the GI computing environment, please send an email to '''cluster-admin@soe.ucsc.edu''' eeac2f6477f9616bc0c3f88556d8ccac50dec18f 366 362 2023-06-14T23:35:18Z Weiler 3 wikitext text/x-wiki Welcome to the Genomic Institute Computing Information Repository! Browse the below topics for help in the area you are curious about. == GI Public Computing Environment == *[[How to access the public servers]] == GI Firewalled Computing Environment (PRISM) == *[[Access to the Firewalled Compute Servers]] *[[Firewalled Computing Resources Overview]] *[[Firewalled Environment Storage Overview]] *[[Firewalled User Account and Storage Cost]] ==VPN Access== *[[Requirement for users to get GI VPN access]] == NIH dbGaP Access Requirements == *[[Requirements for dbGaP Access]] == giCloud Openstack == *[[Overview of giCloud in the Genomics Institute]] *[[Quick Start Instructions to Get Rolling with OpenStack]] == Amazon Web Services Information == *[[Overview of Getting and Using an AWS IAM Account]] *[[AWS Account List and Numbers]] *[[AWS Shared Bucket Usage Graphs]] *[[AWS Best Practices]] *[[AWS S3 Lifecycle Management]] == Slurm at the Genomics Institute == *[[Overview of using Slurm]] *[[Annotated Slurm Script]] *[[Job Arrays]] *[[GPU Resources]] *[[Quick Reference Guide]] *[[Slurm Tips for vg]] *[[Slurm Tips for Toil]] ==Kubernetes Information== *[[Computational Genomics Kubernetes Installation]] *[[Undiagnosed Disease Project Kubernetes Installation]] == Problems or technical support == If you have any problems with the GI computing environment, please send an email to '''cluster-admin@soe.ucsc.edu''' 5f8d916b43a917e980fb97f5c6c8b944f15f8ce4 372 366 2023-06-27T21:06:49Z Weiler 3 wikitext text/x-wiki Welcome to the Genomic Institute Computing Information Repository! Browse the below topics for help in the area you are curious about. == GI Public Computing Environment == *[[How to access the public servers]] == GI Firewalled Computing Environment (PRISM) == *[[Access to the Firewalled Compute Servers]] *[[Firewalled Computing Resources Overview]] *[[Firewalled Environment Storage Overview]] *[[Firewalled User Account and Storage Cost]] ==VPN Access== *[[Requirement for users to get GI VPN access]] == NIH dbGaP Access Requirements == *[[Requirements for dbGaP Access]] == giCloud Openstack == *[[Overview of giCloud in the Genomics Institute]] *[[Quick Start Instructions to Get Rolling with OpenStack]] == Amazon Web Services Information == *[[Overview of Getting and Using an AWS IAM Account]] *[[AWS Account List and Numbers]] *[[AWS Shared Bucket Usage Graphs]] *[[AWS Best Practices]] *[[AWS S3 Lifecycle Management]] == Slurm at the Genomics Institute == *[[Overview of using Slurm]] *[[Annotated Slurm Script]] *[[Job Arrays]] *[[GPU Resources]] *[[Quick Reference Guide]] *[[Slurm Tips for vg]] *[[Slurm Tips for Toil]] *[[Using Docker under Slurm]] ==Kubernetes Information== *[[Computational Genomics Kubernetes Installation]] *[[Undiagnosed Disease Project Kubernetes Installation]] == Problems or technical support == If you have any problems with the GI computing environment, please send an email to '''cluster-admin@soe.ucsc.edu''' 170c1050868c541613fc683885be775ddacc7045 375 372 2023-06-28T21:46:31Z Anovak 4 /* Slurm at the Genomics Institute */ wikitext text/x-wiki Welcome to the Genomic Institute Computing Information Repository! Browse the below topics for help in the area you are curious about. == GI Public Computing Environment == *[[How to access the public servers]] == GI Firewalled Computing Environment (PRISM) == *[[Access to the Firewalled Compute Servers]] *[[Firewalled Computing Resources Overview]] *[[Firewalled Environment Storage Overview]] *[[Firewalled User Account and Storage Cost]] ==VPN Access== *[[Requirement for users to get GI VPN access]] == NIH dbGaP Access Requirements == *[[Requirements for dbGaP Access]] == giCloud Openstack == *[[Overview of giCloud in the Genomics Institute]] *[[Quick Start Instructions to Get Rolling with OpenStack]] == Amazon Web Services Information == *[[Overview of Getting and Using an AWS IAM Account]] *[[AWS Account List and Numbers]] *[[AWS Shared Bucket Usage Graphs]] *[[AWS Best Practices]] *[[AWS S3 Lifecycle Management]] == Slurm at the Genomics Institute == *[[Overview of using Slurm]] *[[Annotated Slurm Script]] *[[Job Arrays]] *[[GPU Resources]] *[[Quick Reference Guide]] *[[Slurm Tips for vg]] *[[Slurm Tips for Toil]] *[[Using Docker under Slurm]] *[[Phoenix WDL Tutorial]] ==Kubernetes Information== *[[Computational Genomics Kubernetes Installation]] *[[Undiagnosed Disease Project Kubernetes Installation]] == Problems or technical support == If you have any problems with the GI computing environment, please send an email to '''cluster-admin@soe.ucsc.edu''' 3de61c8046c86b3167d1e08b096ee86a46ca070c Firewalled Computing Resources Overview 0 41 354 2023-06-14T23:19:38Z Weiler 3 Created page with "== Doing Work and Computing == When doing research, running jobs and the like, please be careful of your resource consumption on the server you are on. Don't run too many th..." wikitext text/x-wiki == Doing Work and Computing == When doing research, running jobs and the like, please be careful of your resource consumption on the server you are on. Don't run too many threads or cores at once if such a thing overruns the RAM available or the disk IO available. If you are not sure of your potential RAM, CPU or disk impact, start small with one or two processes and work your way up from there. Also, before running your stuff, check what else is already happening on the server by using the 'top' command to see who else and what else is running and what kind of resources are already being consumed. If, after starting a process, you realize that the server slows down considerably or becomes unusable, kill your processes and re-evaluate what you need to make things work. These servers are shared resources - be a good neighbor! == Server Types and Management == After confirming your VPN software is working, you can ssh into one of the shared compute servers behind the VPN: {| class="wikitable" style="text-align:center;" |- style="font-weight:bold; text-align:left;" ! Node Name ! Operating System<br /> ! CPU Cores ! Memory ! Network Bandwidth ! Scratch Space |- | style="text-align:left;" | mustard | style="text-align:left;" | Ubuntu 22.04 | 160 | 1.5 TB | 10 Gb/s | 9 TB |- | style="text-align:left;" | crimson | style="text-align:left;" | CentOS 7.5 | 32 | 256 GB | 10 Gb/s | 5.5 TB |- | style="text-align:left;" | razzmatazz | style="text-align:left;" | Ubuntu 22.04 | 32 | 256 GB | 10 Gb/s | 5.5 TB |} These ''shared'' servers are managed by the Genomics Institute Cluster Admin group. If you need software installed on any of these servers, please make your request by emailing cluster-admin@soe.ucsc.edu. == The Firewall == All servers are behind a firewall in this environment, and as such, you must connect to the VPN in order to access them. They will not be accessible from the greater Internet without VPN. Although you will be able to connect outbound from them to other servers on the internet to copy data in, sync git repos, stuff like that. It is only inbound connections that will be blocked. All machines behind the firewall have the private domain name suffix of "*.prism". == The Phoenix Cluster == This is a cluster of ~20 Ubuntu 22.04 nodes, some of which have GPUs in them. Each node generally has about 2TB RAM and 128 cores, although the cluster is heterogeneous and has multiple node types. {| class="wikitable" style="text-align:center;" |- style="font-weight:bold;" ! Node Name ! Operating System<br /> ! CPU Cores ! GPUs/Type ! Memory ! Network Bandwidth ! Scratch Space |- | style="text-align:left;" | phoenix-00 | style="text-align:left;" | Ubuntu 22.04 | 256 | 8 / nVidia A100 | 1 TB | 10 Gb/s | 16 TB NVMe |- | style="text-align:left;" | phoenix-[01-05] | style="text-align:left;" | Ubuntu 22.04 | 128 | 8 / nVidia A5500 | 2 TB | 10 Gb/s | 16 TB NVMe |- | style="text-align:left;" | phoenix-[06-08] | style="text-align:left;" | Ubuntu 22.04 | 128 | N/A | 2 TB | 10 Gb/s | 16 TB NVMe |- | style="text-align:left;" | phoenix-09 | style="text-align:left;" | Ubuntu 22.04 | 40 | 8 / nVidia 1080ti | 256 GB | 10 Gb/s | 1.5 TB NVMe |- | style="text-align:left;" | phoenix-10 | style="text-align:left;" | Ubuntu 22.04 | 36 | 4 / nVidia 2080ti | 385 GB | 10 Gb/s | 3 TB NVMe |- | style="text-align:left;" | phoenix-[11-20] | style="text-align:left;" | Ubuntu 22.04 | 128 | N/A | 2 TB | 10 Gb/s | 16 TB NVMe |} The cluster head node, from which all jobs are submitted via the SLURM job scheduling framework, is '''phoenix.prism'''. To learn more about how to use Slurm, refer to: https://giwiki.gi.ucsc.edu/index.php/Genomics_Institute_Computing_Information#Slurm_at_the_Genomics_Institute For scratch on the cluster, TMPDIR will be set to /data/tmp. That area is cleaned often so don't store any data there that isn't being used by your jobs. 4640cadc8eb15f88448b64e5ebccb861bfdac95c 368 354 2023-06-14T23:36:51Z Weiler 3 wikitext text/x-wiki == Doing Work and Computing == When doing research, running jobs and the like, please be careful of your resource consumption on the server you are on. Don't run too many threads or cores at once if such a thing overruns the RAM available or the disk IO available. If you are not sure of your potential RAM, CPU or disk impact, start small with one or two processes and work your way up from there. Also, before running your stuff, check what else is already happening on the server by using the 'top' command to see who else and what else is running and what kind of resources are already being consumed. If, after starting a process, you realize that the server slows down considerably or becomes unusable, kill your processes and re-evaluate what you need to make things work. These servers are shared resources - be a good neighbor! == Server Types and Management == After confirming your VPN software is working, you can ssh into one of the shared compute servers behind the VPN. The DNS suffix for all machines is ".prism". So, "mustard" would have a full DNS name of "mustard.prism": {| class="wikitable" style="text-align:center;" |- style="font-weight:bold; text-align:left;" ! Node Name ! Operating System<br /> ! CPU Cores ! Memory ! Network Bandwidth ! Scratch Space |- | style="text-align:left;" | mustard | style="text-align:left;" | Ubuntu 22.04 | 160 | 1.5 TB | 10 Gb/s | 9 TB |- | style="text-align:left;" | crimson | style="text-align:left;" | CentOS 7.5 | 32 | 256 GB | 10 Gb/s | 5.5 TB |- | style="text-align:left;" | razzmatazz | style="text-align:left;" | Ubuntu 22.04 | 32 | 256 GB | 10 Gb/s | 5.5 TB |} These ''shared'' servers are managed by the Genomics Institute Cluster Admin group. If you need software installed on any of these servers, please make your request by emailing cluster-admin@soe.ucsc.edu. == The Firewall == All servers are behind a firewall in this environment, and as such, you must connect to the VPN in order to access them. They will not be accessible from the greater Internet without VPN. Although you will be able to connect outbound from them to other servers on the internet to copy data in, sync git repos, stuff like that. It is only inbound connections that will be blocked. All machines behind the firewall have the private domain name suffix of "*.prism". == The Phoenix Cluster == This is a cluster of ~20 Ubuntu 22.04 nodes, some of which have GPUs in them. Each node generally has about 2TB RAM and 128 cores, although the cluster is heterogeneous and has multiple node types. {| class="wikitable" style="text-align:center;" |- style="font-weight:bold;" ! Node Name ! Operating System<br /> ! CPU Cores ! GPUs/Type ! Memory ! Network Bandwidth ! Scratch Space |- | style="text-align:left;" | phoenix-00 | style="text-align:left;" | Ubuntu 22.04 | 256 | 8 / nVidia A100 | 1 TB | 10 Gb/s | 16 TB NVMe |- | style="text-align:left;" | phoenix-[01-05] | style="text-align:left;" | Ubuntu 22.04 | 128 | 8 / nVidia A5500 | 2 TB | 10 Gb/s | 16 TB NVMe |- | style="text-align:left;" | phoenix-[06-08] | style="text-align:left;" | Ubuntu 22.04 | 128 | N/A | 2 TB | 10 Gb/s | 16 TB NVMe |- | style="text-align:left;" | phoenix-09 | style="text-align:left;" | Ubuntu 22.04 | 40 | 8 / nVidia 1080ti | 256 GB | 10 Gb/s | 1.5 TB NVMe |- | style="text-align:left;" | phoenix-10 | style="text-align:left;" | Ubuntu 22.04 | 36 | 4 / nVidia 2080ti | 385 GB | 10 Gb/s | 3 TB NVMe |- | style="text-align:left;" | phoenix-[11-20] | style="text-align:left;" | Ubuntu 22.04 | 128 | N/A | 2 TB | 10 Gb/s | 16 TB NVMe |} The cluster head node, from which all jobs are submitted via the SLURM job scheduling framework, is '''phoenix.prism'''. To learn more about how to use Slurm, refer to: https://giwiki.gi.ucsc.edu/index.php/Genomics_Institute_Computing_Information#Slurm_at_the_Genomics_Institute For scratch on the cluster, TMPDIR will be set to /data/tmp. That area is cleaned often so don't store any data there that isn't being used by your jobs. 96f6f0dab1427db080142f59e09a96348328c81d 369 368 2023-06-14T23:58:11Z Weiler 3 /* The Phoenix Cluster */ wikitext text/x-wiki == Doing Work and Computing == When doing research, running jobs and the like, please be careful of your resource consumption on the server you are on. Don't run too many threads or cores at once if such a thing overruns the RAM available or the disk IO available. If you are not sure of your potential RAM, CPU or disk impact, start small with one or two processes and work your way up from there. Also, before running your stuff, check what else is already happening on the server by using the 'top' command to see who else and what else is running and what kind of resources are already being consumed. If, after starting a process, you realize that the server slows down considerably or becomes unusable, kill your processes and re-evaluate what you need to make things work. These servers are shared resources - be a good neighbor! == Server Types and Management == After confirming your VPN software is working, you can ssh into one of the shared compute servers behind the VPN. The DNS suffix for all machines is ".prism". So, "mustard" would have a full DNS name of "mustard.prism": {| class="wikitable" style="text-align:center;" |- style="font-weight:bold; text-align:left;" ! Node Name ! Operating System<br /> ! CPU Cores ! Memory ! Network Bandwidth ! Scratch Space |- | style="text-align:left;" | mustard | style="text-align:left;" | Ubuntu 22.04 | 160 | 1.5 TB | 10 Gb/s | 9 TB |- | style="text-align:left;" | crimson | style="text-align:left;" | CentOS 7.5 | 32 | 256 GB | 10 Gb/s | 5.5 TB |- | style="text-align:left;" | razzmatazz | style="text-align:left;" | Ubuntu 22.04 | 32 | 256 GB | 10 Gb/s | 5.5 TB |} These ''shared'' servers are managed by the Genomics Institute Cluster Admin group. If you need software installed on any of these servers, please make your request by emailing cluster-admin@soe.ucsc.edu. == The Firewall == All servers are behind a firewall in this environment, and as such, you must connect to the VPN in order to access them. They will not be accessible from the greater Internet without VPN. Although you will be able to connect outbound from them to other servers on the internet to copy data in, sync git repos, stuff like that. It is only inbound connections that will be blocked. All machines behind the firewall have the private domain name suffix of "*.prism". == The Phoenix Cluster == This is a cluster of ~20 Ubuntu 22.04 nodes, some of which have GPUs in them. Each node generally has about 2TB RAM and 128 cores, although the cluster is heterogeneous and has multiple node types. {| class="wikitable" style="text-align:center;" |- style="font-weight:bold;" ! Node Name ! Operating System<br /> ! CPU Cores ! GPUs/Type ! Memory ! Network Bandwidth ! Scratch Space |- | style="text-align:left;" | phoenix-00 | style="text-align:left;" | Ubuntu 22.04 | 256 | 8 / Nvidia A100 | 1 TB | 10 Gb/s | 16 TB NVMe |- | style="text-align:left;" | phoenix-[01-05] | style="text-align:left;" | Ubuntu 22.04 | 128 | 8 / Nvidia A5500 | 2 TB | 10 Gb/s | 16 TB NVMe |- | style="text-align:left;" | phoenix-[06-08] | style="text-align:left;" | Ubuntu 22.04 | 128 | N/A | 2 TB | 10 Gb/s | 16 TB NVMe |- | style="text-align:left;" | phoenix-09 | style="text-align:left;" | Ubuntu 22.04 | 40 | 8 / Nvidia 1080ti | 256 GB | 10 Gb/s | 1.5 TB NVMe |- | style="text-align:left;" | phoenix-10 | style="text-align:left;" | Ubuntu 22.04 | 36 | 4 / Nvidia 2080ti | 385 GB | 10 Gb/s | 3 TB NVMe |- | style="text-align:left;" | phoenix-[11-20] | style="text-align:left;" | Ubuntu 22.04 | 128 | N/A | 2 TB | 10 Gb/s | 16 TB NVMe |} The cluster head node, from which all jobs are submitted via the SLURM job scheduling framework, is '''phoenix.prism'''. To learn more about how to use Slurm, refer to: https://giwiki.gi.ucsc.edu/index.php/Genomics_Institute_Computing_Information#Slurm_at_the_Genomics_Institute For scratch on the cluster, TMPDIR will be set to /data/tmp. That area is cleaned often so don't store any data there that isn't being used by your jobs. 00419de86b37dbb16ba9acede2ba449a7870f600 379 369 2023-06-29T01:54:29Z Weiler 3 /* Server Types and Management */ wikitext text/x-wiki == Doing Work and Computing == When doing research, running jobs and the like, please be careful of your resource consumption on the server you are on. Don't run too many threads or cores at once if such a thing overruns the RAM available or the disk IO available. If you are not sure of your potential RAM, CPU or disk impact, start small with one or two processes and work your way up from there. Also, before running your stuff, check what else is already happening on the server by using the 'top' command to see who else and what else is running and what kind of resources are already being consumed. If, after starting a process, you realize that the server slows down considerably or becomes unusable, kill your processes and re-evaluate what you need to make things work. These servers are shared resources - be a good neighbor! == Server Types and Management == After confirming your VPN software is working, you can ssh into one of the shared compute servers behind the VPN. The DNS suffix for all machines is ".prism". So, "mustard" would have a full DNS name of "mustard.prism": {| class="wikitable" style="text-align:center;" |- style="font-weight:bold; text-align:left;" ! Node Name ! Operating System<br /> ! CPU Cores ! Memory ! Network Bandwidth ! Scratch Space |- | style="text-align:left;" | mustard | style="text-align:left;" | Ubuntu 22.04 | 160 | 1.5 TB | 10 Gb/s | 9 TB |- | style="text-align:left;" | mustard | style="text-align:left;" | Ubuntu 22.04 | 64 | 1 TB | 10 Gb/s | 690 GB |- | style="text-align:left;" | crimson | style="text-align:left;" | CentOS 7.5 | 32 | 256 GB | 10 Gb/s | 5.5 TB |- | style="text-align:left;" | razzmatazz | style="text-align:left;" | Ubuntu 22.04 | 32 | 256 GB | 10 Gb/s | 5.5 TB |} These ''shared'' servers are managed by the Genomics Institute Cluster Admin group. If you need software installed on any of these servers, please make your request by emailing cluster-admin@soe.ucsc.edu. == The Firewall == All servers are behind a firewall in this environment, and as such, you must connect to the VPN in order to access them. They will not be accessible from the greater Internet without VPN. Although you will be able to connect outbound from them to other servers on the internet to copy data in, sync git repos, stuff like that. It is only inbound connections that will be blocked. All machines behind the firewall have the private domain name suffix of "*.prism". == The Phoenix Cluster == This is a cluster of ~20 Ubuntu 22.04 nodes, some of which have GPUs in them. Each node generally has about 2TB RAM and 128 cores, although the cluster is heterogeneous and has multiple node types. {| class="wikitable" style="text-align:center;" |- style="font-weight:bold;" ! Node Name ! Operating System<br /> ! CPU Cores ! GPUs/Type ! Memory ! Network Bandwidth ! Scratch Space |- | style="text-align:left;" | phoenix-00 | style="text-align:left;" | Ubuntu 22.04 | 256 | 8 / Nvidia A100 | 1 TB | 10 Gb/s | 16 TB NVMe |- | style="text-align:left;" | phoenix-[01-05] | style="text-align:left;" | Ubuntu 22.04 | 128 | 8 / Nvidia A5500 | 2 TB | 10 Gb/s | 16 TB NVMe |- | style="text-align:left;" | phoenix-[06-08] | style="text-align:left;" | Ubuntu 22.04 | 128 | N/A | 2 TB | 10 Gb/s | 16 TB NVMe |- | style="text-align:left;" | phoenix-09 | style="text-align:left;" | Ubuntu 22.04 | 40 | 8 / Nvidia 1080ti | 256 GB | 10 Gb/s | 1.5 TB NVMe |- | style="text-align:left;" | phoenix-10 | style="text-align:left;" | Ubuntu 22.04 | 36 | 4 / Nvidia 2080ti | 385 GB | 10 Gb/s | 3 TB NVMe |- | style="text-align:left;" | phoenix-[11-20] | style="text-align:left;" | Ubuntu 22.04 | 128 | N/A | 2 TB | 10 Gb/s | 16 TB NVMe |} The cluster head node, from which all jobs are submitted via the SLURM job scheduling framework, is '''phoenix.prism'''. To learn more about how to use Slurm, refer to: https://giwiki.gi.ucsc.edu/index.php/Genomics_Institute_Computing_Information#Slurm_at_the_Genomics_Institute For scratch on the cluster, TMPDIR will be set to /data/tmp. That area is cleaned often so don't store any data there that isn't being used by your jobs. 471cc445b567b5864a5788f8bc7a3cf3079296a1 380 379 2023-06-29T01:54:42Z Weiler 3 /* Server Types and Management */ wikitext text/x-wiki == Doing Work and Computing == When doing research, running jobs and the like, please be careful of your resource consumption on the server you are on. Don't run too many threads or cores at once if such a thing overruns the RAM available or the disk IO available. If you are not sure of your potential RAM, CPU or disk impact, start small with one or two processes and work your way up from there. Also, before running your stuff, check what else is already happening on the server by using the 'top' command to see who else and what else is running and what kind of resources are already being consumed. If, after starting a process, you realize that the server slows down considerably or becomes unusable, kill your processes and re-evaluate what you need to make things work. These servers are shared resources - be a good neighbor! == Server Types and Management == After confirming your VPN software is working, you can ssh into one of the shared compute servers behind the VPN. The DNS suffix for all machines is ".prism". So, "mustard" would have a full DNS name of "mustard.prism": {| class="wikitable" style="text-align:center;" |- style="font-weight:bold; text-align:left;" ! Node Name ! Operating System<br /> ! CPU Cores ! Memory ! Network Bandwidth ! Scratch Space |- | style="text-align:left;" | mustard | style="text-align:left;" | Ubuntu 22.04 | 160 | 1.5 TB | 10 Gb/s | 9 TB |- | style="text-align:left;" | emerald | style="text-align:left;" | Ubuntu 22.04 | 64 | 1 TB | 10 Gb/s | 690 GB |- | style="text-align:left;" | crimson | style="text-align:left;" | CentOS 7.5 | 32 | 256 GB | 10 Gb/s | 5.5 TB |- | style="text-align:left;" | razzmatazz | style="text-align:left;" | Ubuntu 22.04 | 32 | 256 GB | 10 Gb/s | 5.5 TB |} These ''shared'' servers are managed by the Genomics Institute Cluster Admin group. If you need software installed on any of these servers, please make your request by emailing cluster-admin@soe.ucsc.edu. == The Firewall == All servers are behind a firewall in this environment, and as such, you must connect to the VPN in order to access them. They will not be accessible from the greater Internet without VPN. Although you will be able to connect outbound from them to other servers on the internet to copy data in, sync git repos, stuff like that. It is only inbound connections that will be blocked. All machines behind the firewall have the private domain name suffix of "*.prism". == The Phoenix Cluster == This is a cluster of ~20 Ubuntu 22.04 nodes, some of which have GPUs in them. Each node generally has about 2TB RAM and 128 cores, although the cluster is heterogeneous and has multiple node types. {| class="wikitable" style="text-align:center;" |- style="font-weight:bold;" ! Node Name ! Operating System<br /> ! CPU Cores ! GPUs/Type ! Memory ! Network Bandwidth ! Scratch Space |- | style="text-align:left;" | phoenix-00 | style="text-align:left;" | Ubuntu 22.04 | 256 | 8 / Nvidia A100 | 1 TB | 10 Gb/s | 16 TB NVMe |- | style="text-align:left;" | phoenix-[01-05] | style="text-align:left;" | Ubuntu 22.04 | 128 | 8 / Nvidia A5500 | 2 TB | 10 Gb/s | 16 TB NVMe |- | style="text-align:left;" | phoenix-[06-08] | style="text-align:left;" | Ubuntu 22.04 | 128 | N/A | 2 TB | 10 Gb/s | 16 TB NVMe |- | style="text-align:left;" | phoenix-09 | style="text-align:left;" | Ubuntu 22.04 | 40 | 8 / Nvidia 1080ti | 256 GB | 10 Gb/s | 1.5 TB NVMe |- | style="text-align:left;" | phoenix-10 | style="text-align:left;" | Ubuntu 22.04 | 36 | 4 / Nvidia 2080ti | 385 GB | 10 Gb/s | 3 TB NVMe |- | style="text-align:left;" | phoenix-[11-20] | style="text-align:left;" | Ubuntu 22.04 | 128 | N/A | 2 TB | 10 Gb/s | 16 TB NVMe |} The cluster head node, from which all jobs are submitted via the SLURM job scheduling framework, is '''phoenix.prism'''. To learn more about how to use Slurm, refer to: https://giwiki.gi.ucsc.edu/index.php/Genomics_Institute_Computing_Information#Slurm_at_the_Genomics_Institute For scratch on the cluster, TMPDIR will be set to /data/tmp. That area is cleaned often so don't store any data there that isn't being used by your jobs. c69246e9ecd06a129a9880d3820b3b15dbd6131d 400 380 2023-07-28T18:17:04Z Weiler 3 /* Server Types and Management */ wikitext text/x-wiki == Doing Work and Computing == When doing research, running jobs and the like, please be careful of your resource consumption on the server you are on. Don't run too many threads or cores at once if such a thing overruns the RAM available or the disk IO available. If you are not sure of your potential RAM, CPU or disk impact, start small with one or two processes and work your way up from there. Also, before running your stuff, check what else is already happening on the server by using the 'top' command to see who else and what else is running and what kind of resources are already being consumed. If, after starting a process, you realize that the server slows down considerably or becomes unusable, kill your processes and re-evaluate what you need to make things work. These servers are shared resources - be a good neighbor! == Server Types and Management == After confirming your VPN software is working, you can ssh into one of the shared compute servers behind the VPN. The DNS suffix for all machines is ".prism". So, "mustard" would have a full DNS name of "mustard.prism": {| class="wikitable" style="text-align:center;" |- style="font-weight:bold; text-align:left;" ! Node Name ! Operating System<br /> ! CPU Cores ! Memory ! Network Bandwidth ! Scratch Space |- | style="text-align:left;" | mustard | style="text-align:left;" | Ubuntu 22.04 | 160 | 1.5 TB | 10 Gb/s | 9 TB |- | style="text-align:left;" | emerald | style="text-align:left;" | Ubuntu 22.04 | 64 | 1 TB | 10 Gb/s | 690 GB |- | style="text-align:left;" | crimson | style="text-align:left;" | Ubuntu 22.04 | 32 | 256 GB | 10 Gb/s | 5.5 TB |- | style="text-align:left;" | razzmatazz | style="text-align:left;" | Ubuntu 22.04 | 32 | 256 GB | 10 Gb/s | 5.5 TB |} These ''shared'' servers are managed by the Genomics Institute Cluster Admin group. If you need software installed on any of these servers, please make your request by emailing cluster-admin@soe.ucsc.edu. == The Firewall == All servers are behind a firewall in this environment, and as such, you must connect to the VPN in order to access them. They will not be accessible from the greater Internet without VPN. Although you will be able to connect outbound from them to other servers on the internet to copy data in, sync git repos, stuff like that. It is only inbound connections that will be blocked. All machines behind the firewall have the private domain name suffix of "*.prism". == The Phoenix Cluster == This is a cluster of ~20 Ubuntu 22.04 nodes, some of which have GPUs in them. Each node generally has about 2TB RAM and 128 cores, although the cluster is heterogeneous and has multiple node types. {| class="wikitable" style="text-align:center;" |- style="font-weight:bold;" ! Node Name ! Operating System<br /> ! CPU Cores ! GPUs/Type ! Memory ! Network Bandwidth ! Scratch Space |- | style="text-align:left;" | phoenix-00 | style="text-align:left;" | Ubuntu 22.04 | 256 | 8 / Nvidia A100 | 1 TB | 10 Gb/s | 16 TB NVMe |- | style="text-align:left;" | phoenix-[01-05] | style="text-align:left;" | Ubuntu 22.04 | 128 | 8 / Nvidia A5500 | 2 TB | 10 Gb/s | 16 TB NVMe |- | style="text-align:left;" | phoenix-[06-08] | style="text-align:left;" | Ubuntu 22.04 | 128 | N/A | 2 TB | 10 Gb/s | 16 TB NVMe |- | style="text-align:left;" | phoenix-09 | style="text-align:left;" | Ubuntu 22.04 | 40 | 8 / Nvidia 1080ti | 256 GB | 10 Gb/s | 1.5 TB NVMe |- | style="text-align:left;" | phoenix-10 | style="text-align:left;" | Ubuntu 22.04 | 36 | 4 / Nvidia 2080ti | 385 GB | 10 Gb/s | 3 TB NVMe |- | style="text-align:left;" | phoenix-[11-20] | style="text-align:left;" | Ubuntu 22.04 | 128 | N/A | 2 TB | 10 Gb/s | 16 TB NVMe |} The cluster head node, from which all jobs are submitted via the SLURM job scheduling framework, is '''phoenix.prism'''. To learn more about how to use Slurm, refer to: https://giwiki.gi.ucsc.edu/index.php/Genomics_Institute_Computing_Information#Slurm_at_the_Genomics_Institute For scratch on the cluster, TMPDIR will be set to /data/tmp. That area is cleaned often so don't store any data there that isn't being used by your jobs. cdc59031f99cb87f5a0b0c110f629677410d5a38 Firewalled Environment Storage Overview 0 39 356 348 2023-06-14T23:23:23Z Weiler 3 wikitext text/x-wiki == Storage == Our servers mount two types of shared ''Italic text''storage; home directories and group storage directories. These home directories will mount over the network to all shared compute servers and the phoenix cluster, so any server you login to will have these filesystems available: '''Filesystem Specifications''' {| class="wikitable" style="text-align:center;" |- style="font-weight:bold; text-align:left;" ! Filesystem<br /> ! /private/home ! /private/groups |- | style="font-weight:bold; text-align:left;" | Default Soft Quota | 30 GB | 15 TB |- | style="font-weight:bold; text-align:left;" | Default Hard Quota | 31 GB | 16 TB |- | style="font-weight:bold; text-align:left;" | Total Capacity | 19 TB | 500 TB |- style="text-align:left;" | style="font-weight:bold;" | Access Speed | Slow - Moderate (Spinning Disk) | Very Fast (NVMe Flash Media) |- style="text-align:left;" | style="font-weight:bold;" | Intended Use | This space should be used for login scripts, small bits of code or software repos, etc. No large data should be stored here. | This space should be used for large computational/shared data, large software installations and the like. |} '''Home Directories (/private/home/username)''' Your home directory will be located as "/private/home/username" and has a 30GB soft quota and a 31GB hard quota. Your home directory is meant for small scripts and login data, or a git repo. Please do not try to store large data there or computer on large jobs using data in your home directory. '''Groups Directories (/private/group/groupname''' The group storage directories are created per PI, and each group directory has a default 15TB soft quota and 16TB hard quota. For example, if David Haussler is the PI that you report to directly, then the directory would exist as /private/groups/hausslerlab. Request access to that group directory and you will then be able to write to it. Each of those group directories are shared by the lab it belongs to, so you must be wary of everyone's data usage and share the 15TB available per group accordingly. On the compute servers you can check your group's current quota usage by using the '/usr/bin/viewquota' command. You can only check the quota of a group you are part of (you would be a member of the UNIX group of the same name). If you wanted to check the quota usage of /private/groups/hausslerlab for example, you would do: $ viewquota hausslerlab Project quota on /export (/dev/mapper/export) Project ID Used Soft Hard Warn/Grace ---------- --------------------------------- hausslerlab 1.8T 15T 16T 00 [------] '''Soft Versus Hard Quotas''' We use soft and hard quotas for disk space. Once you exceed a directory's soft quota, a one-week countdown timer starts. When that timer runs out, you will no longer be able to create new files or write more data in that directory. You can reset the countdown timer by dropping down to under the soft quota limit. You will not be permitted to exceed a directory's hard quota at all. Any attempt to try will produce an error; the precise error will depend on how your software responds to running out of disk space. When quotas are first applied to a directory, or are reduced, it is possible to end up with more data or files in the directory than the quota allows for. This outcome does not trigger deletion of any existing data, but will prevent creation of new data or files. == /scratch Space on the Servers == Each server will generally have a local /scratch filesystem that you can use to store temporary files. '''BE ADVISED''' that /scratch is not backed up, and the data there could disappear in the event of a disk failure or anything else. Do not store important data there. If it is important, it should be moved somewhere else very soon after creation. 4f06ada435c3b0e7217c2cd4688d2ab6c07ac1c7 357 356 2023-06-14T23:23:45Z Weiler 3 /* Storage */ wikitext text/x-wiki == Storage == Our servers mount two types of ''shared'' storage; home directories and group storage directories. These home directories will mount over the network to all shared compute servers and the phoenix cluster, so any server you login to will have these filesystems available: '''Filesystem Specifications''' {| class="wikitable" style="text-align:center;" |- style="font-weight:bold; text-align:left;" ! Filesystem<br /> ! /private/home ! /private/groups |- | style="font-weight:bold; text-align:left;" | Default Soft Quota | 30 GB | 15 TB |- | style="font-weight:bold; text-align:left;" | Default Hard Quota | 31 GB | 16 TB |- | style="font-weight:bold; text-align:left;" | Total Capacity | 19 TB | 500 TB |- style="text-align:left;" | style="font-weight:bold;" | Access Speed | Slow - Moderate (Spinning Disk) | Very Fast (NVMe Flash Media) |- style="text-align:left;" | style="font-weight:bold;" | Intended Use | This space should be used for login scripts, small bits of code or software repos, etc. No large data should be stored here. | This space should be used for large computational/shared data, large software installations and the like. |} '''Home Directories (/private/home/username)''' Your home directory will be located as "/private/home/username" and has a 30GB soft quota and a 31GB hard quota. Your home directory is meant for small scripts and login data, or a git repo. Please do not try to store large data there or computer on large jobs using data in your home directory. '''Groups Directories (/private/group/groupname''' The group storage directories are created per PI, and each group directory has a default 15TB soft quota and 16TB hard quota. For example, if David Haussler is the PI that you report to directly, then the directory would exist as /private/groups/hausslerlab. Request access to that group directory and you will then be able to write to it. Each of those group directories are shared by the lab it belongs to, so you must be wary of everyone's data usage and share the 15TB available per group accordingly. On the compute servers you can check your group's current quota usage by using the '/usr/bin/viewquota' command. You can only check the quota of a group you are part of (you would be a member of the UNIX group of the same name). If you wanted to check the quota usage of /private/groups/hausslerlab for example, you would do: $ viewquota hausslerlab Project quota on /export (/dev/mapper/export) Project ID Used Soft Hard Warn/Grace ---------- --------------------------------- hausslerlab 1.8T 15T 16T 00 [------] '''Soft Versus Hard Quotas''' We use soft and hard quotas for disk space. Once you exceed a directory's soft quota, a one-week countdown timer starts. When that timer runs out, you will no longer be able to create new files or write more data in that directory. You can reset the countdown timer by dropping down to under the soft quota limit. You will not be permitted to exceed a directory's hard quota at all. Any attempt to try will produce an error; the precise error will depend on how your software responds to running out of disk space. When quotas are first applied to a directory, or are reduced, it is possible to end up with more data or files in the directory than the quota allows for. This outcome does not trigger deletion of any existing data, but will prevent creation of new data or files. == /scratch Space on the Servers == Each server will generally have a local /scratch filesystem that you can use to store temporary files. '''BE ADVISED''' that /scratch is not backed up, and the data there could disappear in the event of a disk failure or anything else. Do not store important data there. If it is important, it should be moved somewhere else very soon after creation. ef8559b4d254a3babd18814e13bc3c3db01ac6b4 358 357 2023-06-14T23:24:20Z Weiler 3 /* Storage */ wikitext text/x-wiki == Storage == Our servers mount two types of ''shared'' storage; home directories and group storage directories. These home directories will mount over the network to all shared compute servers and the phoenix cluster, so any server you login to will have these filesystems available: '''Filesystem Specifications''' {| class="wikitable" style="text-align:center;" |- style="font-weight:bold; text-align:left;" ! Filesystem<br /> ! /private/home ! /private/groups |- | style="font-weight:bold; text-align:left;" | Default Soft Quota | 30 GB | 15 TB |- | style="font-weight:bold; text-align:left;" | Default Hard Quota | 31 GB | 16 TB |- | style="font-weight:bold; text-align:left;" | Total Capacity | 19 TB | 500 TB |- style="text-align:left;" | style="font-weight:bold;" | Access Speed | Slow - Moderate (Spinning Disk) | Very Fast (NVMe Flash Media) |- style="text-align:left;" | style="font-weight:bold;" | Intended Use | This space should be used for login scripts, small bits of code or software repos, etc. No large data should be stored here. | This space should be used for large computational/shared data, large software installations and the like. |} '''Home Directories (/private/home/username)''' Your home directory will be located as "/private/home/username" and has a 30GB soft quota and a 31GB hard quota. Your home directory is meant for small scripts and login data, or a git repo. Please do not try to store large data there or computer on large jobs using data in your home directory. '''Groups Directories (/private/group/groupname)''' The group storage directories are created per PI, and each group directory has a default 15TB soft quota and 16TB hard quota. For example, if David Haussler is the PI that you report to directly, then the directory would exist as /private/groups/hausslerlab. Request access to that group directory and you will then be able to write to it. Each of those group directories are shared by the lab it belongs to, so you must be wary of everyone's data usage and share the 15TB available per group accordingly. On the compute servers you can check your group's current quota usage by using the '/usr/bin/viewquota' command. You can only check the quota of a group you are part of (you would be a member of the UNIX group of the same name). If you wanted to check the quota usage of /private/groups/hausslerlab for example, you would do: $ viewquota hausslerlab Project quota on /export (/dev/mapper/export) Project ID Used Soft Hard Warn/Grace ---------- --------------------------------- hausslerlab 1.8T 15T 16T 00 [------] '''Soft Versus Hard Quotas''' We use soft and hard quotas for disk space. Once you exceed a directory's soft quota, a one-week countdown timer starts. When that timer runs out, you will no longer be able to create new files or write more data in that directory. You can reset the countdown timer by dropping down to under the soft quota limit. You will not be permitted to exceed a directory's hard quota at all. Any attempt to try will produce an error; the precise error will depend on how your software responds to running out of disk space. When quotas are first applied to a directory, or are reduced, it is possible to end up with more data or files in the directory than the quota allows for. This outcome does not trigger deletion of any existing data, but will prevent creation of new data or files. == /scratch Space on the Servers == Each server will generally have a local /scratch filesystem that you can use to store temporary files. '''BE ADVISED''' that /scratch is not backed up, and the data there could disappear in the event of a disk failure or anything else. Do not store important data there. If it is important, it should be moved somewhere else very soon after creation. 10342fc927d0b3f3a0458c379b606074e4190a3f 359 358 2023-06-14T23:25:32Z Weiler 3 /* Storage */ wikitext text/x-wiki == Storage == Our servers mount two types of ''shared'' storage; home directories and group storage directories. These home directories will mount over the network to all shared compute servers and the phoenix cluster, so any server you login to will have these filesystems available: '''Filesystem Specifications''' {| class="wikitable" style="text-align:center;" |- style="font-weight:bold; text-align:left;" ! Filesystem<br /> ! /private/home ! /private/groups |- | style="font-weight:bold; text-align:left;" | Default Soft Quota | 30 GB | 15 TB |- | style="font-weight:bold; text-align:left;" | Default Hard Quota | 31 GB | 16 TB |- | style="font-weight:bold; text-align:left;" | Total Capacity | 19 TB | 500 TB |- style="text-align:left;" | style="font-weight:bold;" | Access Speed | Moderate (Spinning Disk) | Very Fast (NVMe Flash Media) |- style="text-align:left;" | style="font-weight:bold;" | Intended Use | This space should be used for login scripts, small bits of code or software repos, etc. No large data should be stored here. | This space should be used for large computational/shared data, large software installations and the like. |} '''Home Directories (/private/home/username)''' Your home directory will be located as "/private/home/username" and has a 30GB soft quota and a 31GB hard quota. Your home directory is meant for small scripts and login data, or a git repo. Please do not try to store large data there or computer on large jobs using data in your home directory. '''Groups Directories (/private/group/groupname)''' The group storage directories are created per PI, and each group directory has a default 15TB soft quota and 16TB hard quota. For example, if David Haussler is the PI that you report to directly, then the directory would exist as /private/groups/hausslerlab. Request access to that group directory and you will then be able to write to it. Each of those group directories are shared by the lab it belongs to, so you must be wary of everyone's data usage and share the 15TB available per group accordingly. On the compute servers you can check your group's current quota usage by using the '/usr/bin/viewquota' command. You can only check the quota of a group you are part of (you would be a member of the UNIX group of the same name). If you wanted to check the quota usage of /private/groups/hausslerlab for example, you would do: $ viewquota hausslerlab Project quota on /export (/dev/mapper/export) Project ID Used Soft Hard Warn/Grace ---------- --------------------------------- hausslerlab 1.8T 15T 16T 00 [------] '''Soft Versus Hard Quotas''' We use soft and hard quotas for disk space. Once you exceed a directory's soft quota, a one-week countdown timer starts. When that timer runs out, you will no longer be able to create new files or write more data in that directory. You can reset the countdown timer by dropping down to under the soft quota limit. You will not be permitted to exceed a directory's hard quota at all. Any attempt to try will produce an error; the precise error will depend on how your software responds to running out of disk space. When quotas are first applied to a directory, or are reduced, it is possible to end up with more data or files in the directory than the quota allows for. This outcome does not trigger deletion of any existing data, but will prevent creation of new data or files. == /scratch Space on the Servers == Each server will generally have a local /scratch filesystem that you can use to store temporary files. '''BE ADVISED''' that /scratch is not backed up, and the data there could disappear in the event of a disk failure or anything else. Do not store important data there. If it is important, it should be moved somewhere else very soon after creation. 81954e6468c8c452de3d5f4e16e798b6dda6608d 360 359 2023-06-14T23:26:18Z Weiler 3 /* Storage */ wikitext text/x-wiki == Storage == Our servers mount two types of ''shared'' storage; home directories and group storage directories. These home directories will mount over the network to all shared compute servers and the phoenix cluster, so any server you login to will have these filesystems available: '''Filesystem Specifications''' {| class="wikitable" style="text-align:center;" |- style="font-weight:bold; text-align:left;" ! Filesystem<br /> ! /private/home ! /private/groups |- | style="font-weight:bold; text-align:left;" | Default Soft Quota | 30 GB | 15 TB |- | style="font-weight:bold; text-align:left;" | Default Hard Quota | 31 GB | 16 TB |- | style="font-weight:bold; text-align:left;" | Total Capacity | 19 TB | 500 TB |- style="text-align:left;" | style="font-weight:bold;" | Access Speed | Moderate (Spinning Disk) | Very Fast (NVMe Flash Media) |- style="text-align:left;" | style="font-weight:bold;" | Intended Use | This space should be used for login scripts, small bits of code or software repos, etc. No large data should be stored here. | This space should be used for large computational/shared data, large software installations and the like. |} '''Home Directories (/private/home/username)''' Your home directory will be located as "/private/home/username" and has a 30GB soft quota and a 31GB hard quota. Your home directory is meant for small scripts and login data, or a git repo. Please do not try to store large data there or computer on large jobs using data in your home directory. '''Groups Directories (/private/groups/groupname)''' The group storage directories are created per PI, and each group directory has a default 15TB soft quota and 16TB hard quota. For example, if David Haussler is the PI that you report to directly, then the directory would exist as /private/groups/hausslerlab. Request access to that group directory and you will then be able to write to it. Each of those group directories are shared by the lab it belongs to, so you must be wary of everyone's data usage and share the 15TB available per group accordingly. On the compute servers you can check your group's current quota usage by using the '/usr/bin/viewquota' command. You can only check the quota of a group you are part of (you would be a member of the UNIX group of the same name). If you wanted to check the quota usage of /private/groups/hausslerlab for example, you would do: $ viewquota hausslerlab Project quota on /export (/dev/mapper/export) Project ID Used Soft Hard Warn/Grace ---------- --------------------------------- hausslerlab 1.8T 15T 16T 00 [------] '''Soft Versus Hard Quotas''' We use soft and hard quotas for disk space. Once you exceed a directory's soft quota, a one-week countdown timer starts. When that timer runs out, you will no longer be able to create new files or write more data in that directory. You can reset the countdown timer by dropping down to under the soft quota limit. You will not be permitted to exceed a directory's hard quota at all. Any attempt to try will produce an error; the precise error will depend on how your software responds to running out of disk space. When quotas are first applied to a directory, or are reduced, it is possible to end up with more data or files in the directory than the quota allows for. This outcome does not trigger deletion of any existing data, but will prevent creation of new data or files. == /scratch Space on the Servers == Each server will generally have a local /scratch filesystem that you can use to store temporary files. '''BE ADVISED''' that /scratch is not backed up, and the data there could disappear in the event of a disk failure or anything else. Do not store important data there. If it is important, it should be moved somewhere else very soon after creation. e18a634d9c93f067400fdc4b99b278915ef721ff 392 360 2023-07-16T14:56:12Z Weiler 3 wikitext text/x-wiki == Storage == Our servers mount two types of ''shared'' storage; home directories and group storage directories. These home directories will mount over the network to all shared compute servers and the phoenix cluster, so any server you login to will have these filesystems available: '''Filesystem Specifications''' {| class="wikitable" style="text-align:center;" |- style="font-weight:bold; text-align:left;" ! Filesystem<br /> ! /private/home ! /private/groups |- | style="font-weight:bold; text-align:left;" | Default Soft Quota | 30 GB | 15 TB |- | style="font-weight:bold; text-align:left;" | Default Hard Quota | 31 GB | 16 TB |- | style="font-weight:bold; text-align:left;" | Total Capacity | 19 TB | 500 TB |- style="text-align:left;" | style="font-weight:bold;" | Access Speed | Very Fast (NVMe Flash Media) | Very Fast (NVMe Flash Media) |- style="text-align:left;" | style="font-weight:bold;" | Intended Use | This space should be used for login scripts, small bits of code or software repos, etc. No large data should be stored here. | This space should be used for large computational/shared data, large software installations and the like. |} '''Home Directories (/private/home/username)''' Your home directory will be located as "/private/home/username" and has a 30GB soft quota and a 31GB hard quota. Your home directory is meant for small scripts and login data, or a git repo. Please do not try to store large data there or computer on large jobs using data in your home directory. '''Groups Directories (/private/groups/groupname)''' The group storage directories are created per PI, and each group directory has a default 15TB soft quota and 16TB hard quota. For example, if David Haussler is the PI that you report to directly, then the directory would exist as /private/groups/hausslerlab. Request access to that group directory and you will then be able to write to it. Each of those group directories are shared by the lab it belongs to, so you must be wary of everyone's data usage and share the 15TB available per group accordingly. On the compute servers you can check your group's current quota usage by using the '/usr/bin/viewquota' command. You can only check the quota of a group you are part of (you would be a member of the UNIX group of the same name). If you wanted to check the quota usage of /private/groups/hausslerlab for example, you would do: $ viewquota hausslerlab Project quota on /export (/dev/mapper/export) Project ID Used Soft Hard Warn/Grace ---------- --------------------------------- hausslerlab 1.8T 15T 16T 00 [------] '''Soft Versus Hard Quotas''' We use soft and hard quotas for disk space. Once you exceed a directory's soft quota, a one-week countdown timer starts. When that timer runs out, you will no longer be able to create new files or write more data in that directory. You can reset the countdown timer by dropping down to under the soft quota limit. You will not be permitted to exceed a directory's hard quota at all. Any attempt to try will produce an error; the precise error will depend on how your software responds to running out of disk space. When quotas are first applied to a directory, or are reduced, it is possible to end up with more data or files in the directory than the quota allows for. This outcome does not trigger deletion of any existing data, but will prevent creation of new data or files. == /scratch Space on the Servers == Each server will generally have a local /scratch filesystem that you can use to store temporary files. '''BE ADVISED''' that /scratch is not backed up, and the data there could disappear in the event of a disk failure or anything else. Do not store important data there. If it is important, it should be moved somewhere else very soon after creation. bcbd1619120facf47dfd5123c6a8998741d21afb 393 392 2023-07-16T18:19:36Z Weiler 3 /* /scratch Space on the Servers */ wikitext text/x-wiki == Storage == Our servers mount two types of ''shared'' storage; home directories and group storage directories. These home directories will mount over the network to all shared compute servers and the phoenix cluster, so any server you login to will have these filesystems available: '''Filesystem Specifications''' {| class="wikitable" style="text-align:center;" |- style="font-weight:bold; text-align:left;" ! Filesystem<br /> ! /private/home ! /private/groups |- | style="font-weight:bold; text-align:left;" | Default Soft Quota | 30 GB | 15 TB |- | style="font-weight:bold; text-align:left;" | Default Hard Quota | 31 GB | 16 TB |- | style="font-weight:bold; text-align:left;" | Total Capacity | 19 TB | 500 TB |- style="text-align:left;" | style="font-weight:bold;" | Access Speed | Very Fast (NVMe Flash Media) | Very Fast (NVMe Flash Media) |- style="text-align:left;" | style="font-weight:bold;" | Intended Use | This space should be used for login scripts, small bits of code or software repos, etc. No large data should be stored here. | This space should be used for large computational/shared data, large software installations and the like. |} '''Home Directories (/private/home/username)''' Your home directory will be located as "/private/home/username" and has a 30GB soft quota and a 31GB hard quota. Your home directory is meant for small scripts and login data, or a git repo. Please do not try to store large data there or computer on large jobs using data in your home directory. '''Groups Directories (/private/groups/groupname)''' The group storage directories are created per PI, and each group directory has a default 15TB soft quota and 16TB hard quota. For example, if David Haussler is the PI that you report to directly, then the directory would exist as /private/groups/hausslerlab. Request access to that group directory and you will then be able to write to it. Each of those group directories are shared by the lab it belongs to, so you must be wary of everyone's data usage and share the 15TB available per group accordingly. On the compute servers you can check your group's current quota usage by using the '/usr/bin/viewquota' command. You can only check the quota of a group you are part of (you would be a member of the UNIX group of the same name). If you wanted to check the quota usage of /private/groups/hausslerlab for example, you would do: $ viewquota hausslerlab Project quota on /export (/dev/mapper/export) Project ID Used Soft Hard Warn/Grace ---------- --------------------------------- hausslerlab 1.8T 15T 16T 00 [------] '''Soft Versus Hard Quotas''' We use soft and hard quotas for disk space. Once you exceed a directory's soft quota, a one-week countdown timer starts. When that timer runs out, you will no longer be able to create new files or write more data in that directory. You can reset the countdown timer by dropping down to under the soft quota limit. You will not be permitted to exceed a directory's hard quota at all. Any attempt to try will produce an error; the precise error will depend on how your software responds to running out of disk space. When quotas are first applied to a directory, or are reduced, it is possible to end up with more data or files in the directory than the quota allows for. This outcome does not trigger deletion of any existing data, but will prevent creation of new data or files. == /data/scratch Space on the Servers == Each server will generally have a local /data/scratch filesystem that you can use to store temporary files. '''BE ADVISED''' that /data/scratch is not backed up, and the data there could disappear in the event of a disk failure or anything else. Do not store important data there. If it is important, it should be moved somewhere else very soon after creation. d063a4213b104cebadc0e43cdc7478557f5a6978 Access to the Firewalled Compute Servers 0 17 361 339 2023-06-14T23:27:30Z Weiler 3 wikitext text/x-wiki Before you can access the firewalled environment (Prism), you must get VPN access to it, which is detailed here: [[Requirement for users to get GI VPN access]] == Account Expiration == Your UNIX account will have an expiration date associated with it after creation. If your account was created in January, February or March, then your account expiration date will be July 1st of the '''current year'''. If the account was created after March, then your expiration date will be July 1st of the '''following year'''. You will receive notice by email when your account is about to expire. To renew, simply ask the PI that sponsored you (which will be included in the notice) to email '''cluster-admin@soe.ucsc.edu''' requesting that your account to be renewed for another year. If your account expires, the account will be suspended and you will no longer be able to login or view any data you may have in our systems. Any automated scripts (owned by you) that run via cron or other mechanisms will cease to function. a7c1e5214097845153106518450e3e375d824ec9 Firewalled Storage Cost 0 42 363 2023-06-14T23:33:04Z Weiler 3 Created page with "== Account and Storage Cost == Costs for having an active UNIX account and for storage (per TB) are listed in this document, under "Genomics Project Support", specifically "G..." wikitext text/x-wiki == Account and Storage Cost == Costs for having an active UNIX account and for storage (per TB) are listed in this document, under "Genomics Project Support", specifically "Genomics IT Systems User Support per user" and "Genomics Data Storage per TB": https://planning.ucsc.edu/budget/rates-and-assessments/recharge-rates/docs/2021-22-approved-recharge-rates.pdf As of the writing of this document, it looks like this: {| class="wikitable" |- style="font-weight:bold; text-align:center;" ! Service ! Cost |- | UNIX User Account per Month | style="text-align:center;" | $28.77 |- | OpenStack User Account per Month | style="text-align:center;" | $28.77 |- | TB of Storage per Month | style="text-align:center;" | $14.97 |} a4eeedba218192a00dfcbe1f327c00eb421688ae 364 363 2023-06-14T23:34:24Z Weiler 3 /* Account and Storage Cost */ wikitext text/x-wiki == Account and Storage Cost == Costs for having an active UNIX account and for storage (per TB) are listed in this document, under "Genomics Project Support", specifically "Genomics IT Systems User Support per user" and "Genomics Data Storage per TB": https://planning.ucsc.edu/budget/rates-and-assessments/recharge-rates/docs/2021-22-approved-recharge-rates.pdf As of the writing of this document, it looks like this: {| class="wikitable" |- style="font-weight:bold; text-align:center;" ! Service ! Cost |- | UNIX User Account per Month | style="text-align:center;" | $28.77 |- | OpenStack User Account per Month | style="text-align:center;" | $28.77 |- | TB of Storage per Month | style="text-align:center;" | $14.97 |} The sponsor of each user and owner of each /private/groups/labname area provides a FOAPAL to our finance group to cover the monthly cost of these resources. 7dc99182212c31eb7d20e9ec7d89c7ff44758d6e 365 364 2023-06-14T23:35:00Z Weiler 3 /* Account and Storage Cost */ wikitext text/x-wiki da39a3ee5e6b4b0d3255bfef95601890afd80709 Firewalled User Account and Storage Cost 0 43 367 2023-06-14T23:35:27Z Weiler 3 Created page with "== Account and Storage Cost == Costs for having an active UNIX account and for storage (per TB) are listed in this document, under "Genomics Project Support", specifically "G..." wikitext text/x-wiki == Account and Storage Cost == Costs for having an active UNIX account and for storage (per TB) are listed in this document, under "Genomics Project Support", specifically "Genomics IT Systems User Support per user" and "Genomics Data Storage per TB": https://planning.ucsc.edu/budget/rates-and-assessments/recharge-rates/docs/2021-22-approved-recharge-rates.pdf As of the writing of this document, it looks like this: {| class="wikitable" |- style="font-weight:bold; text-align:center;" ! Service ! Cost |- | UNIX User Account per Month | style="text-align:center;" | $28.77 |- | OpenStack User Account per Month | style="text-align:center;" | $28.77 |- | TB of Storage per Month | style="text-align:center;" | $14.97 |} The sponsor of each user and owner of each /private/groups/labname area provides a FOAPAL to our finance group to cover the monthly cost of these resources. 7dc99182212c31eb7d20e9ec7d89c7ff44758d6e Overview of using Slurm 0 32 370 338 2023-06-15T16:22:50Z Weiler 3 /* Submit a Slurm Batch Job */ wikitext text/x-wiki When using Slurm, you will need to log into the Slurm head node (currently phoenix.prism). Once you have ssh'd in there, you can execute slurm batch or interactive commands. You might also want to consult the [[Quick Reference Guide]]. == Submit a Slurm Batch Job == In order to submit a Slurm batch job list, you will need to create a directory that you will have read and write access to on all the nodes (which will often be a shared space). Let's say I have a batch named "experiment-1". I would create that directory in my groups area: % mkdir -p /private/groups/clusteradmin/weiler/slurm-jobs/experiment-1 % cd /private/groups/clusteradmin/weiler/slurm-jobs/experiment-1 Then you will need to create your job submission batch file. It will look something like this. My file is called 'slurm-test.sh': % vim slurm-test.sh Then populate the file as necessary: #!/bin/bash # Job name: #SBATCH --job-name=weiler_test # # Partition - This is the queue it goes in: #SBATCH --partition=main # # Where to send email (optional) #SBATCH --mail-user=weiler@ucsc.edu # # Number of nodes you need per job: #SBATCH --nodes=1 # # Memory needed for the jobs. Try very hard to make this accurate. DEFAULT = 4gb #SBATCH --mem=4gb # # Number of tasks (one for each CPU desired for use case) (example): #SBATCH --ntasks=1 # # Processors per task: # At least eight times the number of GPUs needed for nVidia RTX A5500 #SBATCH --cpus-per-task=1 # # Number of GPUs, this can be in the format of "--gres=gpu:[1-8]", or "--gres=gpu:A5500:[1-8]" with the type included (optional) #SBATCH --gres=gpu:1 # # Standard output and error log #SBATCH --output=serial_test_%j.log # # Wall clock limit in hrs:min:sec: #SBATCH --time=00:00:30 # ## Command(s) to run (example): pwd; hostname; date echo "Running test script on a single CPU core" sleep 5 echo "Test done!" date Keep the "SBATCH" lines commented, the scheduler will read them anyway. If you don't need a particular option, just don't include it in the file. To submit the batch job: % sbatch slurm-test.sh Submitted batch job 7 The job(s) will then be scheduled. You can see the state of the queue as such: % squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 7 batch weiler_t weiler R 0:07 1 phoenix-01 The job will output any STDOUT or STDERR in the directory you launched the job from. Other than that, it will do whatever the job does, even if there is no STDOUT. == Launching Several Jobs at Once == You can launch many jobs at once using the $SLURM_ARRAY_TASK_ID variable. Add something like the following to your batch submission file: #SBATCH --array=0-31 #SBATCH --output=array_job_%A_task_%a.out #SBATCH --error=array_job_%A_task_%a.err ## Command(s) to run: echo "I am task $SLURM_ARRAY_TASK_ID" == CGROUPS and Resource Management == Our installation of Slurm will utilize Linux CGROUPS, which puts a hard resource cap on jobs. If you define that your job will need 4GB of RAM, and it uses 5GB, it will fail with an OOM exception. Likewise with CPU or GPU resources; if your job ends up using more than you specify, it will fail. Or the "--time" batch file option, your job will fail if it takes longer than what you specify there. This is to keep the nodes from crashing from runaway jobs that use more resources than you think they will. So... TEST YOUR JOBS! Find our how much in resources a single job needs before you launch 100 of them. == TEST YOUR JOBS! == Let me say that one more time. Test your jobs before launching a bunch of them! If it fails, you don't want it to fail 100 or more times. You can also get a good idea of how much RAM and CPU it will need so you can better define your batch files. It's critical to get a good idea of how many resources each of your jobs will use and define your job file appropriately. 7891e5fd384ed929f11c5c8a705812179a691b47 Slurm Tips for Toil 0 38 371 329 2023-06-27T14:08:56Z Anovak 4 wikitext text/x-wiki Here are some tips for running Toil workflows on the Phoenix Slurm cluster. Mostly you might want to run WDL workflows, but you can use some of these for other workflows like Cactus. You can also consult [https://github.com/DataBiosphere/toil/blob/master/docs/running/wdl.rst#running-wdl-with-toil the Toil documentation on WDL workflows]. * Install Toil with WDL support with: pip3 install --upgrade toil[wdl] To use a development version of Toil, you can install from source instead: pip3 install git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl] * You will then need to make sure your '''~/.local/bin''' directory is on your PATH. Open up your '''~/.bashrc''' file and add: export PATH=$PATH:$HOME/.local/bin Then make sure to log out and back in again. * For Toil options, you will want '''--batchSystem slurm''' to make it use Slurm and '''--batchLogsDir ./logs''' (or some other location on a shared filesystem) for the Slurm logs to not get lost. * You may be able to speed up your workflow with '''--caching true''', to cache data on nodes to be shared among multiple simultaneous tasks. * If using '''toil-wdl-runner''', you might want to add '''--jobStore ./jobStore''' to make sure the job store is in a defined, shared location so that you can use '''--restart''' later. * If using '''toil-wdl-runner''', you will want to set the '''SINGULARITY_CACHEDIR''' and '''MINIWDL__SINGULARITY__IMAGE_CACHE''' environment variables for your workflow to locations on shared storage, and possibly to the default cache locations in your home directory. Otherwise Toil will set them to node-local directories for each node, and thus re-download images for each workflow run, and for each cluster node. To avoid this, you could, for example, before your run or in your '''~/.bashrc''' you could: export SINGULARITY_CACHEDIR=$HOME/.singularity/cache export MINIWDL__SINGULARITY__IMAGE_CACHE=$HOME/.cache/miniwdl e5a78c161565178ada8ca9efdf97ccde6214ebc9 Using Docker under Slurm 0 44 373 2023-06-27T21:21:56Z Weiler 3 Created page with "Sometimes it is convenient to ask Slurm to run your job in a docker container. This is just fine, however, you will need to fully test your job in a docker container beforeha..." wikitext text/x-wiki Sometimes it is convenient to ask Slurm to run your job in a docker container. This is just fine, however, you will need to fully test your job in a docker container beforehand (on mustard or emerald, for example) to see how much RAM and CPU resources it requires, so you can accurately describe in your slurm job submission file how many resources it needs. You can run your container on mustard then look at 'top' to see how much RAM and CPU it needs. You also will need to be aware that you will need to pull your docker image from a registry, like DockerHub or Quay. And you should also run you docker container with the '--rm' flag, so the container cleans itself up after running. So your workflow would look something like this: 1: Pull image from DockerHub 2: docker run --rm docker/welcome-to-docker Optionally you can clean up your image as well, but only if you don't have many jobs using that image on the same node. For example, if I wanted to remove the image laballed "weiler/mytools": $ docker image ls REPOSITORY TAG IMAGE ID CREATED SIZE weiler/mytools latest be6777ad00cf 19 hours ago 396MB somedude/tools latest 9b1d1f6fbf6f 3 weeks ago 607MB $ docker image rm be6777ad00cf We also have auto-cleaning scripts running that will delete any containers and images that were created/pulled more than 7 days ago. This inlcudes the cluster nodes and also the phoenix head node itself. If you need a place to have your images/containers remain longer than that, please put them on mustard,emerald, crimson or razzmatazz. af17656b5237b90a6b8f65291e6224882dd5c756 374 373 2023-06-27T21:22:32Z Weiler 3 wikitext text/x-wiki Sometimes it is convenient to ask Slurm to run your job in a docker container. This is just fine, however, you will need to fully test your job in a docker container beforehand (on mustard or emerald, for example) to see how much RAM and CPU resources it requires, so you can accurately describe in your slurm job submission file how many resources it needs. You can run your container on mustard then look at 'top' to see how much RAM and CPU it needs. You also will need to be aware that you will need to pull your docker image from a registry, like DockerHub or Quay. And you should also run you docker container with the '--rm' flag, so the container cleans itself up after running. So your workflow would look something like this: 1: Pull image from DockerHub 2: docker run --rm docker/welcome-to-docker Optionally you can clean up your image as well, but only if you don't have many jobs using that image on the same node. For example, if I wanted to remove the image laballed "weiler/mytools": $ docker image ls REPOSITORY TAG IMAGE ID CREATED SIZE weiler/mytools latest be6777ad00cf 19 hours ago 396MB somedude/tools latest 9b1d1f6fbf6f 3 weeks ago 607MB $ docker image rm be6777ad00cf We also have auto-cleaning scripts running that will delete any containers and images that were created/pulled more than 7 days ago. This inlcudes the cluster nodes and also the phoenix head node itself. If you need a place to have your images/containers remain longer than that, please put them on mustard, emerald, crimson or razzmatazz. cfe9641f8da6f9e2448e4aaa7a692dc8f6e2f98f 381 374 2023-06-30T16:42:37Z Weiler 3 wikitext text/x-wiki Sometimes it is convenient to ask Slurm to run your job in a docker container. This is just fine, however, you will need to fully test your job in a docker container beforehand (on mustard or emerald, for example) to see how much RAM and CPU resources it requires, so you can accurately describe in your slurm job submission file how many resources it needs. == Testing == You can run your container on mustard then look at 'top' to see how much RAM and CPU it needs. You also will need to be aware that you will need to pull your docker image from a registry, like DockerHub or Quay. And you should also run you docker container with the '--rm' flag, so the container cleans itself up after running. So your workflow would look something like this: 1: Pull image from DockerHub 2: docker run --rm docker/welcome-to-docker Optionally you can clean up your image as well, but only if you don't have many jobs using that image on the same node. For example, if I wanted to remove the image laballed "weiler/mytools": $ docker image ls REPOSITORY TAG IMAGE ID CREATED SIZE weiler/mytools latest be6777ad00cf 19 hours ago 396MB somedude/tools latest 9b1d1f6fbf6f 3 weeks ago 607MB $ docker image rm be6777ad00cf == Resource Limits == When running docker containers on Slurm, slurm cannot limit the resources that docker uses. Therefore, when you launch a container, you will need to know how much resources (RAM, CPU) it uses beforehand, determined by your testing. Then launch your job with the following --cpus and --memory parameters so docker itslef will limit what it uses: docker run --rm '''--cpus=16 --memory=1024m''' docker/welcome-to-docker The --memory argument is in megabytes (hence the 'm' at the end). So the above example will set a memory limit of 1GB. == Cleaning Scripts == We also have auto-cleaning scripts running that will delete any containers and images that were created/pulled more than 7 days ago. This includes the cluster nodes and also the phoenix head node itself. If you need a place to have your images/containers remain longer than that, please put them on mustard, emerald, crimson or razzmatazz. Also, there are cleaning scripts in place that will destroy any running containers that have been running for over 7 days. We assume that such a container was not launched with '''--rm''' and needs to be cleaned up. 03718700afe704719aafdd58b00f954d386a66f7 382 381 2023-06-30T16:42:56Z Weiler 3 wikitext text/x-wiki __TOC__ Sometimes it is convenient to ask Slurm to run your job in a docker container. This is just fine, however, you will need to fully test your job in a docker container beforehand (on mustard or emerald, for example) to see how much RAM and CPU resources it requires, so you can accurately describe in your slurm job submission file how many resources it needs. == Testing == You can run your container on mustard then look at 'top' to see how much RAM and CPU it needs. You also will need to be aware that you will need to pull your docker image from a registry, like DockerHub or Quay. And you should also run you docker container with the '--rm' flag, so the container cleans itself up after running. So your workflow would look something like this: 1: Pull image from DockerHub 2: docker run --rm docker/welcome-to-docker Optionally you can clean up your image as well, but only if you don't have many jobs using that image on the same node. For example, if I wanted to remove the image laballed "weiler/mytools": $ docker image ls REPOSITORY TAG IMAGE ID CREATED SIZE weiler/mytools latest be6777ad00cf 19 hours ago 396MB somedude/tools latest 9b1d1f6fbf6f 3 weeks ago 607MB $ docker image rm be6777ad00cf == Resource Limits == When running docker containers on Slurm, slurm cannot limit the resources that docker uses. Therefore, when you launch a container, you will need to know how much resources (RAM, CPU) it uses beforehand, determined by your testing. Then launch your job with the following --cpus and --memory parameters so docker itslef will limit what it uses: docker run --rm '''--cpus=16 --memory=1024m''' docker/welcome-to-docker The --memory argument is in megabytes (hence the 'm' at the end). So the above example will set a memory limit of 1GB. == Cleaning Scripts == We also have auto-cleaning scripts running that will delete any containers and images that were created/pulled more than 7 days ago. This includes the cluster nodes and also the phoenix head node itself. If you need a place to have your images/containers remain longer than that, please put them on mustard, emerald, crimson or razzmatazz. Also, there are cleaning scripts in place that will destroy any running containers that have been running for over 7 days. We assume that such a container was not launched with '''--rm''' and needs to be cleaned up. 1610bc006943e8967550e2c6c674431f263cb745 Phoenix WDL Tutorial 0 45 376 2023-06-28T21:46:41Z Anovak 4 Created page with "=Tutorial: Getting Started with WDL Workflows on Phoenix= Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experime..." wikitext text/x-wiki =Tutorial: Getting Started with WDL Workflows on Phoenix= Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you. Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order. This tutorial will help you get started writing and running workflows on the Genomics Institute's Phoenix cluster. By the end, you will be able to run workflows on Phoenix with [https://toil.readthedocs.io/en/latest/ Toil], and how to write your own workflows in WDL. ==Setup== Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH. ===Getting VPN access=== We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall. To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster. ===Connecting to Phoenix=== Once you have VPN access, you can connect to the "head node" of the Phoenix cluster. This node is where everyone logs in, but you should *not* run actual work on this node; it exists only to give you access to the files on the cluster and to the commands to control cluster jobs. To connect to the head node: 1. Connect to the VPN. 2. SSH to <code>phoenix.prism</code>. At the command line, run: ssh phoenix.prism If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run: ssh flastname@phoenix.prism The first time you connect, you will see a message like: The authenticity of host 'phoenix.prism (10.50.1.66)' can't be established. ED25519 key fingerprint is SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI. This key is not known by any other names Are you sure you want to continue connecting (yes/no/[fingerprint])? This is your computer asking you to help it decide if it is talking to the genuine <code>phoenix.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is. ===Installing Toil with WDL support=== Once you are on the head node, you can install Toil, a program for running workflows. When installing, you need to specify that you want WDL support. To do this, you can run: pip install --upgrade --user toil[wdl] This will install Toil in the <code>.local</code> directory inside your home directory, which we write as </code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>. By default, the command interpreter *will not* look there, so if you type <code>toil-wdl-runner</code>, it will complain that the command is not found. To fix this, you need to configure the command interpreter (bash) to look where Toil is installed. To do this, run: echo 'export PATH="${HOME}/.local/bin:${PATH}"' >>~/.bashrc After that, **log out and log back in**, to restart bash and pick up the change. To make sure it worked, you can run: toil-wdl-runner --help If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports. If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above. ===Configuring Toil for Phoenix=== Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. So, use these commands to make sure that Toil knows where it ought to put its caches: echo 'export SINGULARITY_CACHEDIR="${HOME}/.singularity/cache"' >>~/.bashrc echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="${HOME}/.cache/miniwdl"' >>~/.bashrc After that, **log out and log back in again**, to apply the changes. If you don't do this, Toil will re-download each container image, on each node, for each run of each workflow. That wastes a lot of time, and can exhaust the [https://docs.docker.com/docker-hub/download-rate-limit/#whats-the-download-rate-limit-on-docker-hub limits on how many containers you are allowed to download each day], so it is important not to skip this step. ==Running an existing workflow== First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project]. First, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally. wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl ===Preparing an input file=== Near the top of the WDL file, there's a section like this: workflow hello_caller { input { File who } This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one. So first, we have to make that list of names. Let's make it in <code>names.txt</code> echo "Mridula Resurrección" >names.txt echo "Gershom Šarlota" >>names.txt echo "Ritchie Ravi" >>names.txt Then, we need to create an *inputs file*, which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this: echo '{"hello_caller.who": "./names.txt"}' >inputs.json Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification]. ===Testing at small scale single-machine=== We are now ready to run the workflow! You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running: srun -c 2 --mem 8G --pty bash -i This will start a new shell; to leave it and go back to the head node you can use <code>exit</code>. In your new shell, run this Toil command: toil-wdl-runner self_test.wdl inputs.json -o local_run This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>. This will print a lot of logging to standard error, and to standard output it will print: {"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]} The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containign a greeting for the corresponding person. To leave your interactive Slurm session and return to the head node, use <code>exit</code>. ===Running at larger scale=== Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people! Go get this handy list of people and cut it to length: wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt head -n100 1000_names.txt >100_names.txt And make a new inputs file: echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster. To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it. mkdir -p logs toil-wdl-runner --jobStore ./big_store --batchSystem slurm --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory. ==Writing your own workflow== In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz]. ===Writing the file=== All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files. Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0. So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement: version 1.0 Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>. version 1.0 workflow FizzBuzz { } Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } } Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is *not* optional, and there is no default value, then the user's inputs file *must* specify a value for it in order for the workflow to run. Now we'll start on the body of the workflow, to be inserted just after the inputs section. The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable. Array[Int] numbers = range(item_count) WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more. Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 } Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array. Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only *once* in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>. Note that WDL supports conditional *expressions* with a <code>then</code> and an <code>else</code>, but conditional *statements* only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition. So first, let's handle the special cases. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. } } Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later. To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]) } We can put the code into the workflow now, and set about writing the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>. task stringify_number { } We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections. task stringify_number { input { Int the_number } # ??? output { String the_string # = ??? } } Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string # = ??? } } Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]). task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } } We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage. Then we can put our task into our WDL file: version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren's supposed to need this, but you do need it in 1.1 and Toil doesn't actually support not having one yet, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } output { Array[String] fizzbuzz_results = result } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array. Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs: echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json Then run it on the cluster with Toil: mkdir -p logs toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json Or locally: toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json ==Additional WDL resources== For more information on writing and running WDL workflows, see: * [https://docs.openwdl.org/en/stable/ The WDL dcoumentation] * [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube] * [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification] 5df3e6a659f8dcaf260eff01c736f57dea64c2c2 377 376 2023-06-28T21:47:28Z Anovak 4 wikitext text/x-wiki =Tutorial: Getting Started with WDL Workflows on Phoenix= Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you. Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order. This tutorial will help you get started writing and running workflows on the Genomics Institute's Phoenix cluster. By the end, you will be able to run workflows on Phoenix with [https://toil.readthedocs.io/en/latest/ Toil], and how to write your own workflows in WDL. ==Setup== Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH. ===Getting VPN access=== We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall. To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster. ===Connecting to Phoenix=== Once you have VPN access, you can connect to the "head node" of the Phoenix cluster. This node is where everyone logs in, but you should *not* run actual work on this node; it exists only to give you access to the files on the cluster and to the commands to control cluster jobs. To connect to the head node: 1. Connect to the VPN. 2. SSH to <code>phoenix.prism</code>. At the command line, run: ssh phoenix.prism If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run: ssh flastname@phoenix.prism The first time you connect, you will see a message like: The authenticity of host 'phoenix.prism (10.50.1.66)' can't be established. ED25519 key fingerprint is SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI. This key is not known by any other names Are you sure you want to continue connecting (yes/no/[fingerprint])? This is your computer asking you to help it decide if it is talking to the genuine <code>phoenix.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is. ===Installing Toil with WDL support=== Once you are on the head node, you can install Toil, a program for running workflows. When installing, you need to specify that you want WDL support. To do this, you can run: pip install --upgrade --user toil[wdl] This will install Toil in the <code>.local</code> directory inside your home directory, which we write as </code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>. By default, the command interpreter *will not* look there, so if you type <code>toil-wdl-runner</code>, it will complain that the command is not found. To fix this, you need to configure the command interpreter (bash) to look where Toil is installed. To do this, run: echo 'export PATH="${HOME}/.local/bin:${PATH}"' >>~/.bashrc After that, **log out and log back in**, to restart bash and pick up the change. To make sure it worked, you can run: toil-wdl-runner --help If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports. If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above. ===Configuring Toil for Phoenix=== Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. So, use these commands to make sure that Toil knows where it ought to put its caches: echo 'export SINGULARITY_CACHEDIR="${HOME}/.singularity/cache"' >>~/.bashrc echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="${HOME}/.cache/miniwdl"' >>~/.bashrc After that, **log out and log back in again**, to apply the changes. If you don't do this, Toil will re-download each container image, on each node, for each run of each workflow. That wastes a lot of time, and can exhaust the [https://docs.docker.com/docker-hub/download-rate-limit/#whats-the-download-rate-limit-on-docker-hub limits on how many containers you are allowed to download each day], so it is important not to skip this step. ==Running an existing workflow== First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project]. First, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally. wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl ===Preparing an input file=== Near the top of the WDL file, there's a section like this: workflow hello_caller { input { File who } This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one. So first, we have to make that list of names. Let's make it in <code>names.txt</code> echo "Mridula Resurrección" >names.txt echo "Gershom Šarlota" >>names.txt echo "Ritchie Ravi" >>names.txt Then, we need to create an *inputs file*, which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this: echo '{"hello_caller.who": "./names.txt"}' >inputs.json Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification]. ===Testing at small scale single-machine=== We are now ready to run the workflow! You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running: srun -c 2 --mem 8G --pty bash -i This will start a new shell; to leave it and go back to the head node you can use <code>exit</code>. In your new shell, run this Toil command: toil-wdl-runner self_test.wdl inputs.json -o local_run This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>. This will print a lot of logging to standard error, and to standard output it will print: {"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]} The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containign a greeting for the corresponding person. To leave your interactive Slurm session and return to the head node, use <code>exit</code>. ===Running at larger scale=== Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people! Go get this handy list of people and cut it to length: wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt head -n100 1000_names.txt >100_names.txt And make a new inputs file: echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster. To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it. mkdir -p logs toil-wdl-runner --jobStore ./big_store --batchSystem slurm --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory. ==Writing your own workflow== In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz]. ===Writing the file=== All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files. Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0. So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement: version 1.0 Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>. version 1.0 workflow FizzBuzz { } Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } } Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is *not* optional, and there is no default value, then the user's inputs file *must* specify a value for it in order for the workflow to run. Now we'll start on the body of the workflow, to be inserted just after the inputs section. The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable. Array[Int] numbers = range(item_count) WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more. Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 } Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array. Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only *once* in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>. Note that WDL supports conditional *expressions* with a <code>then</code> and an <code>else</code>, but conditional *statements* only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition. So first, let's handle the special cases. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. } } Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later. To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]) } We can put the code into the workflow now, and set about writing the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>. task stringify_number { } We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections. task stringify_number { input { Int the_number } # ??? output { String the_string # = ??? } } Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string # = ??? } } Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]). task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } } We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage. Then we can put our task into our WDL file: version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren's supposed to need this, but you do need it in 1.1 and Toil doesn't actually support not having one yet, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } output { Array[String] fizzbuzz_results = result } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array. Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs: echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json Then run it on the cluster with Toil: mkdir -p logs toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json Or locally: toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json ==Additional WDL resources== For more information on writing and running WDL workflows, see: * [https://docs.openwdl.org/en/stable/ The WDL dcoumentation] * [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube] * [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification] 137c1ac39d34ed13abfae71a74a92e2fa2696502 383 377 2023-07-07T14:33:13Z Anovak 4 /* Testing at small scale single-machine */ wikitext text/x-wiki =Tutorial: Getting Started with WDL Workflows on Phoenix= Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you. Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order. This tutorial will help you get started writing and running workflows on the Genomics Institute's Phoenix cluster. By the end, you will be able to run workflows on Phoenix with [https://toil.readthedocs.io/en/latest/ Toil], and how to write your own workflows in WDL. ==Setup== Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH. ===Getting VPN access=== We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall. To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster. ===Connecting to Phoenix=== Once you have VPN access, you can connect to the "head node" of the Phoenix cluster. This node is where everyone logs in, but you should *not* run actual work on this node; it exists only to give you access to the files on the cluster and to the commands to control cluster jobs. To connect to the head node: 1. Connect to the VPN. 2. SSH to <code>phoenix.prism</code>. At the command line, run: ssh phoenix.prism If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run: ssh flastname@phoenix.prism The first time you connect, you will see a message like: The authenticity of host 'phoenix.prism (10.50.1.66)' can't be established. ED25519 key fingerprint is SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI. This key is not known by any other names Are you sure you want to continue connecting (yes/no/[fingerprint])? This is your computer asking you to help it decide if it is talking to the genuine <code>phoenix.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is. ===Installing Toil with WDL support=== Once you are on the head node, you can install Toil, a program for running workflows. When installing, you need to specify that you want WDL support. To do this, you can run: pip install --upgrade --user toil[wdl] This will install Toil in the <code>.local</code> directory inside your home directory, which we write as </code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>. By default, the command interpreter *will not* look there, so if you type <code>toil-wdl-runner</code>, it will complain that the command is not found. To fix this, you need to configure the command interpreter (bash) to look where Toil is installed. To do this, run: echo 'export PATH="${HOME}/.local/bin:${PATH}"' >>~/.bashrc After that, **log out and log back in**, to restart bash and pick up the change. To make sure it worked, you can run: toil-wdl-runner --help If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports. If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above. ===Configuring Toil for Phoenix=== Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. So, use these commands to make sure that Toil knows where it ought to put its caches: echo 'export SINGULARITY_CACHEDIR="${HOME}/.singularity/cache"' >>~/.bashrc echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="${HOME}/.cache/miniwdl"' >>~/.bashrc After that, **log out and log back in again**, to apply the changes. If you don't do this, Toil will re-download each container image, on each node, for each run of each workflow. That wastes a lot of time, and can exhaust the [https://docs.docker.com/docker-hub/download-rate-limit/#whats-the-download-rate-limit-on-docker-hub limits on how many containers you are allowed to download each day], so it is important not to skip this step. ==Running an existing workflow== First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project]. First, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally. wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl ===Preparing an input file=== Near the top of the WDL file, there's a section like this: workflow hello_caller { input { File who } This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one. So first, we have to make that list of names. Let's make it in <code>names.txt</code> echo "Mridula Resurrección" >names.txt echo "Gershom Šarlota" >>names.txt echo "Ritchie Ravi" >>names.txt Then, we need to create an *inputs file*, which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this: echo '{"hello_caller.who": "./names.txt"}' >inputs.json Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification]. ===Testing at small scale single-machine=== We are now ready to run the workflow! You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running: srun -c 2 --mem 8G --pty bash -i This will start a new shell; to leave it and go back to the head node you can use <code>exit</code>. In your new shell, run this Toil command: toil-wdl-runner self_test.wdl inputs.json -o local_run This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>. This will print a lot of logging to standard error, and to standard output it will print: {"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]} The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person. To leave your interactive Slurm session and return to the head node, use <code>exit</code>. ===Running at larger scale=== Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people! Go get this handy list of people and cut it to length: wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt head -n100 1000_names.txt >100_names.txt And make a new inputs file: echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster. To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it. mkdir -p logs toil-wdl-runner --jobStore ./big_store --batchSystem slurm --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory. ==Writing your own workflow== In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz]. ===Writing the file=== All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files. Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0. So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement: version 1.0 Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>. version 1.0 workflow FizzBuzz { } Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } } Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is *not* optional, and there is no default value, then the user's inputs file *must* specify a value for it in order for the workflow to run. Now we'll start on the body of the workflow, to be inserted just after the inputs section. The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable. Array[Int] numbers = range(item_count) WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more. Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 } Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array. Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only *once* in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>. Note that WDL supports conditional *expressions* with a <code>then</code> and an <code>else</code>, but conditional *statements* only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition. So first, let's handle the special cases. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. } } Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later. To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]) } We can put the code into the workflow now, and set about writing the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>. task stringify_number { } We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections. task stringify_number { input { Int the_number } # ??? output { String the_string # = ??? } } Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string # = ??? } } Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]). task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } } We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage. Then we can put our task into our WDL file: version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren's supposed to need this, but you do need it in 1.1 and Toil doesn't actually support not having one yet, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } output { Array[String] fizzbuzz_results = result } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array. Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs: echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json Then run it on the cluster with Toil: mkdir -p logs toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json Or locally: toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json ==Additional WDL resources== For more information on writing and running WDL workflows, see: * [https://docs.openwdl.org/en/stable/ The WDL dcoumentation] * [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube] * [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification] 4159a195de9e24566f8d0c689894ce460b85ceb0 384 383 2023-07-07T14:35:06Z Anovak 4 wikitext text/x-wiki =Tutorial: Getting Started with WDL Workflows on Phoenix= Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you. Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order. This tutorial will help you get started writing and running workflows on the Genomics Institute's Phoenix cluster. By the end, you will be able to run workflows on Phoenix with [https://toil.readthedocs.io/en/latest/ Toil], and how to write your own workflows in WDL. ==Setup== Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH. ===Getting VPN access=== We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall. To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster. ===Connecting to Phoenix=== Once you have VPN access, you can connect to the "head node" of the Phoenix cluster. This node is where everyone logs in, but you should *not* run actual work on this node; it exists only to give you access to the files on the cluster and to the commands to control cluster jobs. To connect to the head node: 1. Connect to the VPN. 2. SSH to <code>phoenix.prism</code>. At the command line, run: ssh phoenix.prism If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run: ssh flastname@phoenix.prism The first time you connect, you will see a message like: The authenticity of host 'phoenix.prism (10.50.1.66)' can't be established. ED25519 key fingerprint is SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI. This key is not known by any other names Are you sure you want to continue connecting (yes/no/[fingerprint])? This is your computer asking you to help it decide if it is talking to the genuine <code>phoenix.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is. ===Installing Toil with WDL support=== Once you are on the head node, you can install Toil, a program for running workflows. When installing, you need to specify that you want WDL support. To do this, you can run: pip install --upgrade --user toil[wdl] This will install Toil in the <code>.local</code> directory inside your home directory, which we write as </code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>. By default, the command interpreter *will not* look there, so if you type <code>toil-wdl-runner</code>, it will complain that the command is not found. To fix this, you need to configure the command interpreter (bash) to look where Toil is installed. To do this, run: echo 'export PATH="${HOME}/.local/bin:${PATH}"' >>~/.bashrc After that, **log out and log back in**, to restart bash and pick up the change. To make sure it worked, you can run: toil-wdl-runner --help If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports. If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above. ===Configuring Toil for Phoenix=== Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. So, use these commands to make sure that Toil knows where it ought to put its caches: echo 'export SINGULARITY_CACHEDIR="${HOME}/.singularity/cache"' >>~/.bashrc echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="${HOME}/.cache/miniwdl"' >>~/.bashrc After that, **log out and log back in again**, to apply the changes. If you don't do this, Toil will re-download each container image, on each node, for each run of each workflow. That wastes a lot of time, and can exhaust the [https://docs.docker.com/docker-hub/download-rate-limit/#whats-the-download-rate-limit-on-docker-hub limits on how many containers you are allowed to download each day], so it is important not to skip this step. ==Running an existing workflow== First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project]. First, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally. wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl ===Preparing an input file=== Near the top of the WDL file, there's a section like this: workflow hello_caller { input { File who } This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one. So first, we have to make that list of names. Let's make it in <code>names.txt</code> echo "Mridula Resurrección" >names.txt echo "Gershom Šarlota" >>names.txt echo "Ritchie Ravi" >>names.txt Then, we need to create an *inputs file*, which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this: echo '{"hello_caller.who": "./names.txt"}' >inputs.json Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification]. ===Testing at small scale single-machine=== We are now ready to run the workflow! You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running: srun -c 2 --mem 8G --pty bash -i This will start a new shell; to leave it and go back to the head node you can use <code>exit</code>. In your new shell, run this Toil command: toil-wdl-runner self_test.wdl inputs.json -o local_run This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>. This will print a lot of logging to standard error, and to standard output it will print: {"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]} The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person. To leave your interactive Slurm session and return to the head node, use <code>exit</code>. ===Running at larger scale=== Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people! Go get this handy list of people and cut it to length: wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt head -n100 1000_names.txt >100_names.txt And make a new inputs file: echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster. To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it. mkdir -p logs toil-wdl-runner --jobStore ./big_store --batchSystem slurm --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory. ==Writing your own workflow== In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz]. ===Writing the file=== All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files. Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0. So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement: version 1.0 Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>. version 1.0 workflow FizzBuzz { } Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } } Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is *not* optional, and there is no default value, then the user's inputs file *must* specify a value for it in order for the workflow to run. Now we'll start on the body of the workflow, to be inserted just after the inputs section. The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable. Array[Int] numbers = range(item_count) WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more. Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 } Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array. Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only *once* in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>. Note that WDL supports conditional *expressions* with a <code>then</code> and an <code>else</code>, but conditional *statements* only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition. So first, let's handle the special cases. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. } } Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later. To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]) } We can put the code into the workflow now, and set about writing the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>. task stringify_number { } We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections. task stringify_number { input { Int the_number } # ??? output { String the_string # = ??? } } Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string # = ??? } } Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]). task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } } We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage. Then we can put our task into our WDL file: version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren's supposed to need this, but you do need it in 1.1 and Toil doesn't actually support not having one yet, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } output { Array[String] fizzbuzz_results = result } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array. Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs: echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json Then run it on the cluster with Toil: mkdir -p logs toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json Or locally: toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json ==Additional WDL resources== For more information on writing and running WDL workflows, see: * [https://docs.openwdl.org/en/stable/ The WDL dcoumentation] * [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube] * [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification] a0c34f24329c930ee6b74c4e18f3b1a8c9699d10 385 384 2023-07-07T14:36:19Z Anovak 4 wikitext text/x-wiki =Tutorial: Getting Started with WDL Workflows on Phoenix= Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you. Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order. This tutorial will help you get started writing and running workflows on the Genomics Institute's Phoenix cluster. By the end, you will be able to run workflows on Phoenix with [https://toil.readthedocs.io/en/latest/ Toil], and how to write your own workflows in WDL. ==Setup== Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH. ===Getting VPN access=== We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall. To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster. ===Connecting to Phoenix=== Once you have VPN access, you can connect to the "head node" of the Phoenix cluster. This node is where everyone logs in, but you should *not* run actual work on this node; it exists only to give you access to the files on the cluster and to the commands to control cluster jobs. To connect to the head node: 1. Connect to the VPN. 2. SSH to <code>phoenix.prism</code>. At the command line, run: ssh phoenix.prism If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run: ssh flastname@phoenix.prism The first time you connect, you will see a message like: The authenticity of host 'phoenix.prism (10.50.1.66)' can't be established. ED25519 key fingerprint is SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI. This key is not known by any other names Are you sure you want to continue connecting (yes/no/[fingerprint])? This is your computer asking you to help it decide if it is talking to the genuine <code>phoenix.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is. ===Installing Toil with WDL support=== Once you are on the head node, you can install Toil, a program for running workflows. When installing, you need to specify that you want WDL support. To do this, you can run: pip install --upgrade --user toil[wdl] This will install Toil in the <code>.local</code> directory inside your home directory, which we write as </code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>. By default, the command interpreter *will not* look there, so if you type <code>toil-wdl-runner</code>, it will complain that the command is not found. To fix this, you need to configure the command interpreter (bash) to look where Toil is installed. To do this, run: echo 'export PATH="${HOME}/.local/bin:${PATH}"' >>~/.bashrc After that, **log out and log back in**, to restart bash and pick up the change. To make sure it worked, you can run: toil-wdl-runner --help If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports. If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above. ===Configuring Toil for Phoenix=== Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. So, use these commands to make sure that Toil knows where it ought to put its caches: echo 'export SINGULARITY_CACHEDIR="${HOME}/.singularity/cache"' >>~/.bashrc echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="${HOME}/.cache/miniwdl"' >>~/.bashrc After that, **log out and log back in again**, to apply the changes. If you don't do this, Toil will re-download each container image, on each node, for each run of each workflow. That wastes a lot of time, and can exhaust the [https://docs.docker.com/docker-hub/download-rate-limit/#whats-the-download-rate-limit-on-docker-hub limits on how many containers you are allowed to download each day], so it is important not to skip this step. ==Running an existing workflow== First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project]. First, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally. wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl ===Preparing an input file=== Near the top of the WDL file, there's a section like this: workflow hello_caller { input { File who } This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one. So first, we have to make that list of names. Let's make it in <code>names.txt</code> echo "Mridula Resurrección" >names.txt echo "Gershom Šarlota" >>names.txt echo "Ritchie Ravi" >>names.txt Then, we need to create an *inputs file*, which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this: echo '{"hello_caller.who": "./names.txt"}' >inputs.json Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification]. ===Testing at small scale single-machine=== We are now ready to run the workflow! You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running: srun -c 2 --mem 8G --pty bash -i This will start a new shell; to leave it and go back to the head node you can use <code>exit</code>. In your new shell, run this Toil command: toil-wdl-runner self_test.wdl inputs.json -o local_run This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>. This will print a lot of logging to standard error, and to standard output it will print: {"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]} The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person. To leave your interactive Slurm session and return to the head node, use <code>exit</code>. ===Running at larger scale=== Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people! Go get this handy list of people and cut it to length: wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt head -n100 1000_names.txt >100_names.txt And make a new inputs file: echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster. To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it. mkdir -p logs toil-wdl-runner --jobStore ./big_store --batchSystem slurm --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory. ==Writing your own workflow== In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz]. ===Writing the file=== All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files. Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0. So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement: version 1.0 Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>. version 1.0 workflow FizzBuzz { } Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } } Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is *not* optional, and there is no default value, then the user's inputs file *must* specify a value for it in order for the workflow to run. Now we'll start on the body of the workflow, to be inserted just after the inputs section. The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable. Array[Int] numbers = range(item_count) WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more. Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 } Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array. Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only *once* in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>. Note that WDL supports conditional *expressions* with a <code>then</code> and an <code>else</code>, but conditional *statements* only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition. So first, let's handle the special cases. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. } } Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later. To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]) } We can put the code into the workflow now, and set about writing the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>. task stringify_number { } We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections. task stringify_number { input { Int the_number } # ??? output { String the_string # = ??? } } Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string # = ??? } } Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]). task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } } We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage. Then we can put our task into our WDL file: version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren's supposed to need this, but you do need it in 1.1 and Toil doesn't actually support not having one yet, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } output { Array[String] fizzbuzz_results = result } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array. Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs: echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json Then run it on the cluster with Toil: mkdir -p logs toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json Or locally: toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json ==Additional WDL resources== For more information on writing and running WDL workflows, see: * [https://docs.openwdl.org/en/stable/ The WDL dcoumentation] * [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube] * [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification] a87638e980628d176893a19cb5b3a5e7b09dff4f 386 385 2023-07-11T13:49:04Z Anovak 4 wikitext text/x-wiki =Tutorial: Getting Started with WDL Workflows on Phoenix= Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you. Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order. This tutorial will help you get started writing and running workflows on the Genomics Institute's Phoenix cluster. By the end, you will be able to run workflows on Phoenix with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong. ==Setup== Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH. ===Getting VPN access=== We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall. To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster. ===Connecting to Phoenix=== Once you have VPN access, you can connect to the "head node" of the Phoenix cluster. This node is where everyone logs in, but you should *not* run actual work on this node; it exists only to give you access to the files on the cluster and to the commands to control cluster jobs. To connect to the head node: 1. Connect to the VPN. 2. SSH to <code>phoenix.prism</code>. At the command line, run: ssh phoenix.prism If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run: ssh flastname@phoenix.prism The first time you connect, you will see a message like: The authenticity of host 'phoenix.prism (10.50.1.66)' can't be established. ED25519 key fingerprint is SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI. This key is not known by any other names Are you sure you want to continue connecting (yes/no/[fingerprint])? This is your computer asking you to help it decide if it is talking to the genuine <code>phoenix.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is. ===Installing Toil with WDL support=== Once you are on the head node, you can install Toil, a program for running workflows. When installing, you need to specify that you want WDL support. To do this, you can run: pip install --upgrade --user toil[wdl] This will install Toil in the <code>.local</code> directory inside your home directory, which we write as </code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>. By default, the command interpreter *will not* look there, so if you type <code>toil-wdl-runner</code>, it will complain that the command is not found. To fix this, you need to configure the command interpreter (bash) to look where Toil is installed. To do this, run: echo 'export PATH="${HOME}/.local/bin:${PATH}"' >>~/.bashrc After that, **log out and log back in**, to restart bash and pick up the change. To make sure it worked, you can run: toil-wdl-runner --help If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports. If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above. ===Configuring Toil for Phoenix=== Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. So, use these commands to make sure that Toil knows where it ought to put its caches: echo 'export SINGULARITY_CACHEDIR="${HOME}/.singularity/cache"' >>~/.bashrc echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="${HOME}/.cache/miniwdl"' >>~/.bashrc After that, **log out and log back in again**, to apply the changes. If you don't do this, Toil will re-download each container image, on each node, for each run of each workflow. That wastes a lot of time, and can exhaust the [https://docs.docker.com/docker-hub/download-rate-limit/#whats-the-download-rate-limit-on-docker-hub limits on how many containers you are allowed to download each day], so it is important not to skip this step. ==Running an existing workflow== First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project]. First, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally. wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl ===Preparing an input file=== Near the top of the WDL file, there's a section like this: workflow hello_caller { input { File who } This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one. So first, we have to make that list of names. Let's make it in <code>names.txt</code> echo "Mridula Resurrección" >names.txt echo "Gershom Šarlota" >>names.txt echo "Ritchie Ravi" >>names.txt Then, we need to create an *inputs file*, which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this: echo '{"hello_caller.who": "./names.txt"}' >inputs.json Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification]. ===Testing at small scale single-machine=== We are now ready to run the workflow! You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running: srun -c 2 --mem 8G --pty bash -i This will start a new shell; to leave it and go back to the head node you can use <code>exit</code>. In your new shell, run this Toil command: toil-wdl-runner self_test.wdl inputs.json -o local_run This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>. This will print a lot of logging to standard error, and to standard output it will print: {"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]} The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person. To leave your interactive Slurm session and return to the head node, use <code>exit</code>. ===Running at larger scale=== Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people! Go get this handy list of people and cut it to length: wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt head -n100 1000_names.txt >100_names.txt And make a new inputs file: echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster. To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it. mkdir -p logs toil-wdl-runner --jobStore ./big_store --batchSystem slurm --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory. ==Writing your own workflow== In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz]. ===Writing the file=== All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files. Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0. So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement: version 1.0 Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>. version 1.0 workflow FizzBuzz { } Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } } Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is *not* optional, and there is no default value, then the user's inputs file *must* specify a value for it in order for the workflow to run. Now we'll start on the body of the workflow, to be inserted just after the inputs section. The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable. Array[Int] numbers = range(item_count) WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more. Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 } Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array. Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only *once* in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>. Note that WDL supports conditional *expressions* with a <code>then</code> and an <code>else</code>, but conditional *statements* only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition. So first, let's handle the special cases. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. } } Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later. To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]) } We can put the code into the workflow now, and set about writing the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>. task stringify_number { } We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections. task stringify_number { input { Int the_number } # ??? output { String the_string # = ??? } } Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string # = ??? } } Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]). task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } } We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage. Then we can put our task into our WDL file: version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren's supposed to need this, but you do need it in 1.1 and Toil doesn't actually support not having one yet, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } output { Array[String] fizzbuzz_results = result } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array. Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs: echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json Then run it on the cluster with Toil: mkdir -p logs toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json Or locally: toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json ==Debugging Workflows== ===Frequently Asked Questions=== ====I am getting warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code>==== This happens because Slurm is providing Toil with an <code>XDG_RUNTIME_DIR</code> environment variable that points to a directory that doesn't exist, which the XDG spec says it shouldn't be doing. This is a known bug in the GI Slurm configuration, and Toil is letting you know that it is working around it. ==Additional WDL resources== For more information on writing and running WDL workflows, see: * [https://docs.openwdl.org/en/stable/ The WDL dcoumentation] * [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube] * [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification] 08ebf3b6901ab129f6f8dced040b9c295ec2a00d 387 386 2023-07-11T13:51:51Z Anovak 4 /* Debugging Workflows */ wikitext text/x-wiki =Tutorial: Getting Started with WDL Workflows on Phoenix= Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you. Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order. This tutorial will help you get started writing and running workflows on the Genomics Institute's Phoenix cluster. By the end, you will be able to run workflows on Phoenix with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong. ==Setup== Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH. ===Getting VPN access=== We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall. To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster. ===Connecting to Phoenix=== Once you have VPN access, you can connect to the "head node" of the Phoenix cluster. This node is where everyone logs in, but you should *not* run actual work on this node; it exists only to give you access to the files on the cluster and to the commands to control cluster jobs. To connect to the head node: 1. Connect to the VPN. 2. SSH to <code>phoenix.prism</code>. At the command line, run: ssh phoenix.prism If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run: ssh flastname@phoenix.prism The first time you connect, you will see a message like: The authenticity of host 'phoenix.prism (10.50.1.66)' can't be established. ED25519 key fingerprint is SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI. This key is not known by any other names Are you sure you want to continue connecting (yes/no/[fingerprint])? This is your computer asking you to help it decide if it is talking to the genuine <code>phoenix.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is. ===Installing Toil with WDL support=== Once you are on the head node, you can install Toil, a program for running workflows. When installing, you need to specify that you want WDL support. To do this, you can run: pip install --upgrade --user toil[wdl] This will install Toil in the <code>.local</code> directory inside your home directory, which we write as </code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>. By default, the command interpreter *will not* look there, so if you type <code>toil-wdl-runner</code>, it will complain that the command is not found. To fix this, you need to configure the command interpreter (bash) to look where Toil is installed. To do this, run: echo 'export PATH="${HOME}/.local/bin:${PATH}"' >>~/.bashrc After that, **log out and log back in**, to restart bash and pick up the change. To make sure it worked, you can run: toil-wdl-runner --help If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports. If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above. ===Configuring Toil for Phoenix=== Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. So, use these commands to make sure that Toil knows where it ought to put its caches: echo 'export SINGULARITY_CACHEDIR="${HOME}/.singularity/cache"' >>~/.bashrc echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="${HOME}/.cache/miniwdl"' >>~/.bashrc After that, **log out and log back in again**, to apply the changes. If you don't do this, Toil will re-download each container image, on each node, for each run of each workflow. That wastes a lot of time, and can exhaust the [https://docs.docker.com/docker-hub/download-rate-limit/#whats-the-download-rate-limit-on-docker-hub limits on how many containers you are allowed to download each day], so it is important not to skip this step. ==Running an existing workflow== First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project]. First, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally. wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl ===Preparing an input file=== Near the top of the WDL file, there's a section like this: workflow hello_caller { input { File who } This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one. So first, we have to make that list of names. Let's make it in <code>names.txt</code> echo "Mridula Resurrección" >names.txt echo "Gershom Šarlota" >>names.txt echo "Ritchie Ravi" >>names.txt Then, we need to create an *inputs file*, which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this: echo '{"hello_caller.who": "./names.txt"}' >inputs.json Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification]. ===Testing at small scale single-machine=== We are now ready to run the workflow! You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running: srun -c 2 --mem 8G --pty bash -i This will start a new shell; to leave it and go back to the head node you can use <code>exit</code>. In your new shell, run this Toil command: toil-wdl-runner self_test.wdl inputs.json -o local_run This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>. This will print a lot of logging to standard error, and to standard output it will print: {"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]} The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person. To leave your interactive Slurm session and return to the head node, use <code>exit</code>. ===Running at larger scale=== Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people! Go get this handy list of people and cut it to length: wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt head -n100 1000_names.txt >100_names.txt And make a new inputs file: echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster. To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it. mkdir -p logs toil-wdl-runner --jobStore ./big_store --batchSystem slurm --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory. ==Writing your own workflow== In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz]. ===Writing the file=== All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files. Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0. So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement: version 1.0 Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>. version 1.0 workflow FizzBuzz { } Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } } Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is *not* optional, and there is no default value, then the user's inputs file *must* specify a value for it in order for the workflow to run. Now we'll start on the body of the workflow, to be inserted just after the inputs section. The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable. Array[Int] numbers = range(item_count) WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more. Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 } Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array. Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only *once* in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>. Note that WDL supports conditional *expressions* with a <code>then</code> and an <code>else</code>, but conditional *statements* only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition. So first, let's handle the special cases. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. } } Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later. To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]) } We can put the code into the workflow now, and set about writing the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>. task stringify_number { } We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections. task stringify_number { input { Int the_number } # ??? output { String the_string # = ??? } } Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string # = ??? } } Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]). task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } } We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage. Then we can put our task into our WDL file: version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren's supposed to need this, but you do need it in 1.1 and Toil doesn't actually support not having one yet, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } output { Array[String] fizzbuzz_results = result } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array. Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs: echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json Then run it on the cluster with Toil: mkdir -p logs toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json Or locally: toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json ==Debugging Workflows== ===Frequently Asked Questions=== ====I am getting warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code>==== This happens because Slurm is providing Toil with an <code>XDG_RUNTIME_DIR</code> environment variable that points to a directory that doesn't exist, which the XDG spec says it shouldn't be doing. This is a known bug in the GI Slurm configuration, and Toil is letting you know that it is working around it. ====Toil said it was <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code> but I can't find that file!==== The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil. If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node. ==Additional WDL resources== For more information on writing and running WDL workflows, see: * [https://docs.openwdl.org/en/stable/ The WDL dcoumentation] * [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube] * [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification] 09eb179e6e6ab4cde56b7e1717c2130060f7eca3 388 387 2023-07-11T16:54:10Z Anovak 4 /* Installing Toil with WDL support */ wikitext text/x-wiki =Tutorial: Getting Started with WDL Workflows on Phoenix= Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you. Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order. This tutorial will help you get started writing and running workflows on the Genomics Institute's Phoenix cluster. By the end, you will be able to run workflows on Phoenix with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong. ==Setup== Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH. ===Getting VPN access=== We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall. To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster. ===Connecting to Phoenix=== Once you have VPN access, you can connect to the "head node" of the Phoenix cluster. This node is where everyone logs in, but you should *not* run actual work on this node; it exists only to give you access to the files on the cluster and to the commands to control cluster jobs. To connect to the head node: 1. Connect to the VPN. 2. SSH to <code>phoenix.prism</code>. At the command line, run: ssh phoenix.prism If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run: ssh flastname@phoenix.prism The first time you connect, you will see a message like: The authenticity of host 'phoenix.prism (10.50.1.66)' can't be established. ED25519 key fingerprint is SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI. This key is not known by any other names Are you sure you want to continue connecting (yes/no/[fingerprint])? This is your computer asking you to help it decide if it is talking to the genuine <code>phoenix.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is. ===Installing Toil with WDL support=== Once you are on the head node, you can install Toil, a program for running workflows. When installing, you need to specify that you want WDL support. To do this, you can run: pip install --upgrade --user toil[wdl] This will install Toil in the <code>.local</code> directory inside your home directory, which we write as <code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>. By default, the command interpreter *will not* look there, so if you type <code>toil-wdl-runner</code>, it will complain that the command is not found. To fix this, you need to configure the command interpreter (bash) to look where Toil is installed. To do this, run: echo 'export PATH="${HOME}/.local/bin:${PATH}"' >>~/.bashrc After that, **log out and log back in**, to restart bash and pick up the change. To make sure it worked, you can run: toil-wdl-runner --help If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports. If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above. ===Configuring Toil for Phoenix=== Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. So, use these commands to make sure that Toil knows where it ought to put its caches: echo 'export SINGULARITY_CACHEDIR="${HOME}/.singularity/cache"' >>~/.bashrc echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="${HOME}/.cache/miniwdl"' >>~/.bashrc After that, **log out and log back in again**, to apply the changes. If you don't do this, Toil will re-download each container image, on each node, for each run of each workflow. That wastes a lot of time, and can exhaust the [https://docs.docker.com/docker-hub/download-rate-limit/#whats-the-download-rate-limit-on-docker-hub limits on how many containers you are allowed to download each day], so it is important not to skip this step. ==Running an existing workflow== First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project]. First, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally. wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl ===Preparing an input file=== Near the top of the WDL file, there's a section like this: workflow hello_caller { input { File who } This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one. So first, we have to make that list of names. Let's make it in <code>names.txt</code> echo "Mridula Resurrección" >names.txt echo "Gershom Šarlota" >>names.txt echo "Ritchie Ravi" >>names.txt Then, we need to create an *inputs file*, which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this: echo '{"hello_caller.who": "./names.txt"}' >inputs.json Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification]. ===Testing at small scale single-machine=== We are now ready to run the workflow! You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running: srun -c 2 --mem 8G --pty bash -i This will start a new shell; to leave it and go back to the head node you can use <code>exit</code>. In your new shell, run this Toil command: toil-wdl-runner self_test.wdl inputs.json -o local_run This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>. This will print a lot of logging to standard error, and to standard output it will print: {"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]} The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person. To leave your interactive Slurm session and return to the head node, use <code>exit</code>. ===Running at larger scale=== Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people! Go get this handy list of people and cut it to length: wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt head -n100 1000_names.txt >100_names.txt And make a new inputs file: echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster. To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it. mkdir -p logs toil-wdl-runner --jobStore ./big_store --batchSystem slurm --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory. ==Writing your own workflow== In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz]. ===Writing the file=== All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files. Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0. So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement: version 1.0 Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>. version 1.0 workflow FizzBuzz { } Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } } Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is *not* optional, and there is no default value, then the user's inputs file *must* specify a value for it in order for the workflow to run. Now we'll start on the body of the workflow, to be inserted just after the inputs section. The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable. Array[Int] numbers = range(item_count) WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more. Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 } Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array. Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only *once* in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>. Note that WDL supports conditional *expressions* with a <code>then</code> and an <code>else</code>, but conditional *statements* only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition. So first, let's handle the special cases. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. } } Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later. To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]) } We can put the code into the workflow now, and set about writing the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>. task stringify_number { } We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections. task stringify_number { input { Int the_number } # ??? output { String the_string # = ??? } } Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string # = ??? } } Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]). task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } } We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage. Then we can put our task into our WDL file: version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren's supposed to need this, but you do need it in 1.1 and Toil doesn't actually support not having one yet, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } output { Array[String] fizzbuzz_results = result } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array. Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs: echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json Then run it on the cluster with Toil: mkdir -p logs toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json Or locally: toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json ==Debugging Workflows== ===Frequently Asked Questions=== ====I am getting warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code>==== This happens because Slurm is providing Toil with an <code>XDG_RUNTIME_DIR</code> environment variable that points to a directory that doesn't exist, which the XDG spec says it shouldn't be doing. This is a known bug in the GI Slurm configuration, and Toil is letting you know that it is working around it. ====Toil said it was <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code> but I can't find that file!==== The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil. If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node. ==Additional WDL resources== For more information on writing and running WDL workflows, see: * [https://docs.openwdl.org/en/stable/ The WDL dcoumentation] * [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube] * [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification] e43765b47dcf619b67858999eeffc3715d252ce1 394 388 2023-07-17T15:06:56Z Anovak 4 /* Frequently Asked Questions */ wikitext text/x-wiki =Tutorial: Getting Started with WDL Workflows on Phoenix= Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you. Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order. This tutorial will help you get started writing and running workflows on the Genomics Institute's Phoenix cluster. By the end, you will be able to run workflows on Phoenix with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong. ==Setup== Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH. ===Getting VPN access=== We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall. To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster. ===Connecting to Phoenix=== Once you have VPN access, you can connect to the "head node" of the Phoenix cluster. This node is where everyone logs in, but you should *not* run actual work on this node; it exists only to give you access to the files on the cluster and to the commands to control cluster jobs. To connect to the head node: 1. Connect to the VPN. 2. SSH to <code>phoenix.prism</code>. At the command line, run: ssh phoenix.prism If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run: ssh flastname@phoenix.prism The first time you connect, you will see a message like: The authenticity of host 'phoenix.prism (10.50.1.66)' can't be established. ED25519 key fingerprint is SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI. This key is not known by any other names Are you sure you want to continue connecting (yes/no/[fingerprint])? This is your computer asking you to help it decide if it is talking to the genuine <code>phoenix.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is. ===Installing Toil with WDL support=== Once you are on the head node, you can install Toil, a program for running workflows. When installing, you need to specify that you want WDL support. To do this, you can run: pip install --upgrade --user toil[wdl] This will install Toil in the <code>.local</code> directory inside your home directory, which we write as <code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>. By default, the command interpreter *will not* look there, so if you type <code>toil-wdl-runner</code>, it will complain that the command is not found. To fix this, you need to configure the command interpreter (bash) to look where Toil is installed. To do this, run: echo 'export PATH="${HOME}/.local/bin:${PATH}"' >>~/.bashrc After that, **log out and log back in**, to restart bash and pick up the change. To make sure it worked, you can run: toil-wdl-runner --help If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports. If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above. ===Configuring Toil for Phoenix=== Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. So, use these commands to make sure that Toil knows where it ought to put its caches: echo 'export SINGULARITY_CACHEDIR="${HOME}/.singularity/cache"' >>~/.bashrc echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="${HOME}/.cache/miniwdl"' >>~/.bashrc After that, **log out and log back in again**, to apply the changes. If you don't do this, Toil will re-download each container image, on each node, for each run of each workflow. That wastes a lot of time, and can exhaust the [https://docs.docker.com/docker-hub/download-rate-limit/#whats-the-download-rate-limit-on-docker-hub limits on how many containers you are allowed to download each day], so it is important not to skip this step. ==Running an existing workflow== First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project]. First, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally. wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl ===Preparing an input file=== Near the top of the WDL file, there's a section like this: workflow hello_caller { input { File who } This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one. So first, we have to make that list of names. Let's make it in <code>names.txt</code> echo "Mridula Resurrección" >names.txt echo "Gershom Šarlota" >>names.txt echo "Ritchie Ravi" >>names.txt Then, we need to create an *inputs file*, which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this: echo '{"hello_caller.who": "./names.txt"}' >inputs.json Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification]. ===Testing at small scale single-machine=== We are now ready to run the workflow! You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running: srun -c 2 --mem 8G --pty bash -i This will start a new shell; to leave it and go back to the head node you can use <code>exit</code>. In your new shell, run this Toil command: toil-wdl-runner self_test.wdl inputs.json -o local_run This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>. This will print a lot of logging to standard error, and to standard output it will print: {"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]} The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person. To leave your interactive Slurm session and return to the head node, use <code>exit</code>. ===Running at larger scale=== Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people! Go get this handy list of people and cut it to length: wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt head -n100 1000_names.txt >100_names.txt And make a new inputs file: echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster. To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it. mkdir -p logs toil-wdl-runner --jobStore ./big_store --batchSystem slurm --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory. ==Writing your own workflow== In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz]. ===Writing the file=== All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files. Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0. So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement: version 1.0 Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>. version 1.0 workflow FizzBuzz { } Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } } Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is *not* optional, and there is no default value, then the user's inputs file *must* specify a value for it in order for the workflow to run. Now we'll start on the body of the workflow, to be inserted just after the inputs section. The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable. Array[Int] numbers = range(item_count) WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more. Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 } Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array. Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only *once* in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>. Note that WDL supports conditional *expressions* with a <code>then</code> and an <code>else</code>, but conditional *statements* only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition. So first, let's handle the special cases. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. } } Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later. To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]) } We can put the code into the workflow now, and set about writing the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>. task stringify_number { } We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections. task stringify_number { input { Int the_number } # ??? output { String the_string # = ??? } } Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string # = ??? } } Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]). task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } } We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage. Then we can put our task into our WDL file: version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren's supposed to need this, but you do need it in 1.1 and Toil doesn't actually support not having one yet, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } output { Array[String] fizzbuzz_results = result } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array. Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs: echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json Then run it on the cluster with Toil: mkdir -p logs toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json Or locally: toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json ==Debugging Workflows== ===Frequently Asked Questions=== ====I am getting warnings about <code>XDG_RUNTIME_DIR</code>==== You may be seeing warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code> This happens because Slurm is providing Toil with an <code>XDG_RUNTIME_DIR</code> environment variable that points to a directory that doesn't exist, which the XDG spec says it shouldn't be doing. This is a known bug in the GI Slurm configuration, and Toil is letting you know that it is working around it. ====Toil said it was <code>Redirecting logging</code> somewhere, but I can't find that file!==== Toil will say that it is <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code>, and when running in single machine mode these messages go to the main Toil log. The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil. If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node. ==Additional WDL resources== For more information on writing and running WDL workflows, see: * [https://docs.openwdl.org/en/stable/ The WDL dcoumentation] * [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube] * [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification] 6e119ed05fdd692759c00cacae7ea17c6bd6ea85 395 394 2023-07-17T15:25:58Z Anovak 4 /* Debugging Workflows */ wikitext text/x-wiki =Tutorial: Getting Started with WDL Workflows on Phoenix= Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you. Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order. This tutorial will help you get started writing and running workflows on the Genomics Institute's Phoenix cluster. By the end, you will be able to run workflows on Phoenix with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong. ==Setup== Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH. ===Getting VPN access=== We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall. To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster. ===Connecting to Phoenix=== Once you have VPN access, you can connect to the "head node" of the Phoenix cluster. This node is where everyone logs in, but you should *not* run actual work on this node; it exists only to give you access to the files on the cluster and to the commands to control cluster jobs. To connect to the head node: 1. Connect to the VPN. 2. SSH to <code>phoenix.prism</code>. At the command line, run: ssh phoenix.prism If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run: ssh flastname@phoenix.prism The first time you connect, you will see a message like: The authenticity of host 'phoenix.prism (10.50.1.66)' can't be established. ED25519 key fingerprint is SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI. This key is not known by any other names Are you sure you want to continue connecting (yes/no/[fingerprint])? This is your computer asking you to help it decide if it is talking to the genuine <code>phoenix.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is. ===Installing Toil with WDL support=== Once you are on the head node, you can install Toil, a program for running workflows. When installing, you need to specify that you want WDL support. To do this, you can run: pip install --upgrade --user toil[wdl] This will install Toil in the <code>.local</code> directory inside your home directory, which we write as <code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>. By default, the command interpreter *will not* look there, so if you type <code>toil-wdl-runner</code>, it will complain that the command is not found. To fix this, you need to configure the command interpreter (bash) to look where Toil is installed. To do this, run: echo 'export PATH="${HOME}/.local/bin:${PATH}"' >>~/.bashrc After that, **log out and log back in**, to restart bash and pick up the change. To make sure it worked, you can run: toil-wdl-runner --help If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports. If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above. ===Configuring Toil for Phoenix=== Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. So, use these commands to make sure that Toil knows where it ought to put its caches: echo 'export SINGULARITY_CACHEDIR="${HOME}/.singularity/cache"' >>~/.bashrc echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="${HOME}/.cache/miniwdl"' >>~/.bashrc After that, **log out and log back in again**, to apply the changes. If you don't do this, Toil will re-download each container image, on each node, for each run of each workflow. That wastes a lot of time, and can exhaust the [https://docs.docker.com/docker-hub/download-rate-limit/#whats-the-download-rate-limit-on-docker-hub limits on how many containers you are allowed to download each day], so it is important not to skip this step. ==Running an existing workflow== First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project]. First, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally. wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl ===Preparing an input file=== Near the top of the WDL file, there's a section like this: workflow hello_caller { input { File who } This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one. So first, we have to make that list of names. Let's make it in <code>names.txt</code> echo "Mridula Resurrección" >names.txt echo "Gershom Šarlota" >>names.txt echo "Ritchie Ravi" >>names.txt Then, we need to create an *inputs file*, which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this: echo '{"hello_caller.who": "./names.txt"}' >inputs.json Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification]. ===Testing at small scale single-machine=== We are now ready to run the workflow! You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running: srun -c 2 --mem 8G --pty bash -i This will start a new shell; to leave it and go back to the head node you can use <code>exit</code>. In your new shell, run this Toil command: toil-wdl-runner self_test.wdl inputs.json -o local_run This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>. This will print a lot of logging to standard error, and to standard output it will print: {"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]} The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person. To leave your interactive Slurm session and return to the head node, use <code>exit</code>. ===Running at larger scale=== Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people! Go get this handy list of people and cut it to length: wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt head -n100 1000_names.txt >100_names.txt And make a new inputs file: echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster. To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it. mkdir -p logs toil-wdl-runner --jobStore ./big_store --batchSystem slurm --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory. ==Writing your own workflow== In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz]. ===Writing the file=== All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files. Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0. So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement: version 1.0 Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>. version 1.0 workflow FizzBuzz { } Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } } Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is *not* optional, and there is no default value, then the user's inputs file *must* specify a value for it in order for the workflow to run. Now we'll start on the body of the workflow, to be inserted just after the inputs section. The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable. Array[Int] numbers = range(item_count) WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more. Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 } Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array. Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only *once* in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>. Note that WDL supports conditional *expressions* with a <code>then</code> and an <code>else</code>, but conditional *statements* only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition. So first, let's handle the special cases. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. } } Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later. To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]) } We can put the code into the workflow now, and set about writing the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>. task stringify_number { } We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections. task stringify_number { input { Int the_number } # ??? output { String the_string # = ??? } } Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string # = ??? } } Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]). task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } } We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage. Then we can put our task into our WDL file: version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren's supposed to need this, but you do need it in 1.1 and Toil doesn't actually support not having one yet, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } output { Array[String] fizzbuzz_results = result } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array. Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs: echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json Then run it on the cluster with Toil: mkdir -p logs toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json Or locally: toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json ==Debugging Workflows== Sometimes, your workflow won't work. Try these ideas for figuring out what is going wrong. ===Debugging Options=== When debugging a workflow, make sure to run the workflow with <code>--logDebug</code>, to set the log level to <code>DEBUG</code>, and with <code>--jobStore /some/path/to/a/shared/directory/it/can/create</code> so that the stored files shipped between jobs are in a place you can access them. When debug logging is on, the log from every Toil job is inserted in the main Toil log between these markers: =========> Toil job log is here <========= Normally, only the logs of failing jobs are reproduced like this. ===Reading the Log=== When a WDL workflow fails, you are likely to see a message like this: WDL.runtime.error.CommandFailed: task command failed with exit status 1 [2023-07-16T16:23:54-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host phoenix-15.prism This means that the command line command specified by one of your WDL tasks exited with a failing (i.e. nonzero) exit code, which will happen when either the command line command is written wrong, or when the error detection code in the tool you are trying to run detects and reports an error. Go up higher in the log until you find lines that look like: [2023-07-16T16:23:54-0700] [MainThread] [I] [toil.wdl.wdltoil] Standard error at /data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/stderr.txt: And [2023-07-16T16:23:54-0700] [MainThread] [I] [toil.wdl.wdltoil] Standard output at /data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/stdout.txt: These will be followed by the standard error and standard output log data from the task's command. There may be useful information (such as an error message from the underlying tool) in there. ===Reproducing Problems=== When trying to fix a failing step, it is useful to be able to run a command outside of Toil or WDL that might reproduce the problem. In addition to getting the standard output and standard error logs as described above, you may also need input files for your tool in order to do this. In the log of your failing Toil taks, look for likes like this: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-4f886176ab8344baaf17dc72fc445445/toplog.sh' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmprwhi6h3q/toplog.sh' [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmpjyksfoko/Sample.bam' ... The <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam</code> part is a Toil file ID, and it is a relative path from your <code>--jobStore</code> value to where the file is stored on disk. So if you ran the workflow with </code>--jobStore /private/groups/patenlab/anovak/jobstore</code>, you would look for this file at: /private/groups/patenlab/anovak/jobstore/files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam ===More Ways of Finding Files=== Sometimes, a step might not fail, but you still might want to see the files it is using as input. If you have the job store path, you can use the <code>find</code> command to try and find the files by name. For example, if you want to look at <code>Sample.bam</code>, you can look for it like this: find /path/to/the/jobstore -name "Smaple.bam" If you want to find files that were ''uploaded'' from a job, look for lines like this in the job's log: [2023-07-16T15:58:39-0700] [MainThread] [D] [toil.wdl.wdltoil] Virtualized /data/tmp/2846b6012e3e5535add03b363950dd78/cb23/197c/work/bamPerChrs/Sample.chr14.bam as WDL file toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam You can take the <code>toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam</code> URI and URL-decode it with, for example, [https://www.urldecoder.io/], getting this: toilfile:2703483274:0:files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam Then you can take the part after the last colon, <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam</code>, and that is the path relative to the job store where this file can be found. ===Frequently Asked Questions=== ====I am getting warnings about <code>XDG_RUNTIME_DIR</code>==== You may be seeing warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code> This happens because Slurm is providing Toil with an <code>XDG_RUNTIME_DIR</code> environment variable that points to a directory that doesn't exist, which the XDG spec says it shouldn't be doing. This is a known bug in the GI Slurm configuration, and Toil is letting you know that it is working around it. ====Toil said it was <code>Redirecting logging</code> somewhere, but I can't find that file!==== Toil will say that it is <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code>, and when running in single machine mode these messages go to the main Toil log. The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil. If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node. ==Additional WDL resources== For more information on writing and running WDL workflows, see: * [https://docs.openwdl.org/en/stable/ The WDL dcoumentation] * [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube] * [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification] e94ce99693859a64b8145edb1e1e171bc14c007f 396 395 2023-07-17T15:27:01Z Anovak 4 /* More Ways of Finding Files */ wikitext text/x-wiki =Tutorial: Getting Started with WDL Workflows on Phoenix= Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you. Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order. This tutorial will help you get started writing and running workflows on the Genomics Institute's Phoenix cluster. By the end, you will be able to run workflows on Phoenix with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong. ==Setup== Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH. ===Getting VPN access=== We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall. To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster. ===Connecting to Phoenix=== Once you have VPN access, you can connect to the "head node" of the Phoenix cluster. This node is where everyone logs in, but you should *not* run actual work on this node; it exists only to give you access to the files on the cluster and to the commands to control cluster jobs. To connect to the head node: 1. Connect to the VPN. 2. SSH to <code>phoenix.prism</code>. At the command line, run: ssh phoenix.prism If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run: ssh flastname@phoenix.prism The first time you connect, you will see a message like: The authenticity of host 'phoenix.prism (10.50.1.66)' can't be established. ED25519 key fingerprint is SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI. This key is not known by any other names Are you sure you want to continue connecting (yes/no/[fingerprint])? This is your computer asking you to help it decide if it is talking to the genuine <code>phoenix.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is. ===Installing Toil with WDL support=== Once you are on the head node, you can install Toil, a program for running workflows. When installing, you need to specify that you want WDL support. To do this, you can run: pip install --upgrade --user toil[wdl] This will install Toil in the <code>.local</code> directory inside your home directory, which we write as <code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>. By default, the command interpreter *will not* look there, so if you type <code>toil-wdl-runner</code>, it will complain that the command is not found. To fix this, you need to configure the command interpreter (bash) to look where Toil is installed. To do this, run: echo 'export PATH="${HOME}/.local/bin:${PATH}"' >>~/.bashrc After that, **log out and log back in**, to restart bash and pick up the change. To make sure it worked, you can run: toil-wdl-runner --help If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports. If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above. ===Configuring Toil for Phoenix=== Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. So, use these commands to make sure that Toil knows where it ought to put its caches: echo 'export SINGULARITY_CACHEDIR="${HOME}/.singularity/cache"' >>~/.bashrc echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="${HOME}/.cache/miniwdl"' >>~/.bashrc After that, **log out and log back in again**, to apply the changes. If you don't do this, Toil will re-download each container image, on each node, for each run of each workflow. That wastes a lot of time, and can exhaust the [https://docs.docker.com/docker-hub/download-rate-limit/#whats-the-download-rate-limit-on-docker-hub limits on how many containers you are allowed to download each day], so it is important not to skip this step. ==Running an existing workflow== First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project]. First, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally. wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl ===Preparing an input file=== Near the top of the WDL file, there's a section like this: workflow hello_caller { input { File who } This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one. So first, we have to make that list of names. Let's make it in <code>names.txt</code> echo "Mridula Resurrección" >names.txt echo "Gershom Šarlota" >>names.txt echo "Ritchie Ravi" >>names.txt Then, we need to create an *inputs file*, which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this: echo '{"hello_caller.who": "./names.txt"}' >inputs.json Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification]. ===Testing at small scale single-machine=== We are now ready to run the workflow! You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running: srun -c 2 --mem 8G --pty bash -i This will start a new shell; to leave it and go back to the head node you can use <code>exit</code>. In your new shell, run this Toil command: toil-wdl-runner self_test.wdl inputs.json -o local_run This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>. This will print a lot of logging to standard error, and to standard output it will print: {"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]} The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person. To leave your interactive Slurm session and return to the head node, use <code>exit</code>. ===Running at larger scale=== Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people! Go get this handy list of people and cut it to length: wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt head -n100 1000_names.txt >100_names.txt And make a new inputs file: echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster. To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it. mkdir -p logs toil-wdl-runner --jobStore ./big_store --batchSystem slurm --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory. ==Writing your own workflow== In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz]. ===Writing the file=== All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files. Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0. So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement: version 1.0 Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>. version 1.0 workflow FizzBuzz { } Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } } Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is *not* optional, and there is no default value, then the user's inputs file *must* specify a value for it in order for the workflow to run. Now we'll start on the body of the workflow, to be inserted just after the inputs section. The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable. Array[Int] numbers = range(item_count) WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more. Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 } Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array. Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only *once* in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>. Note that WDL supports conditional *expressions* with a <code>then</code> and an <code>else</code>, but conditional *statements* only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition. So first, let's handle the special cases. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. } } Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later. To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]) } We can put the code into the workflow now, and set about writing the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>. task stringify_number { } We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections. task stringify_number { input { Int the_number } # ??? output { String the_string # = ??? } } Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string # = ??? } } Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]). task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } } We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage. Then we can put our task into our WDL file: version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren's supposed to need this, but you do need it in 1.1 and Toil doesn't actually support not having one yet, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } output { Array[String] fizzbuzz_results = result } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array. Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs: echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json Then run it on the cluster with Toil: mkdir -p logs toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json Or locally: toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json ==Debugging Workflows== Sometimes, your workflow won't work. Try these ideas for figuring out what is going wrong. ===Debugging Options=== When debugging a workflow, make sure to run the workflow with <code>--logDebug</code>, to set the log level to <code>DEBUG</code>, and with <code>--jobStore /some/path/to/a/shared/directory/it/can/create</code> so that the stored files shipped between jobs are in a place you can access them. When debug logging is on, the log from every Toil job is inserted in the main Toil log between these markers: =========> Toil job log is here <========= Normally, only the logs of failing jobs are reproduced like this. ===Reading the Log=== When a WDL workflow fails, you are likely to see a message like this: WDL.runtime.error.CommandFailed: task command failed with exit status 1 [2023-07-16T16:23:54-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host phoenix-15.prism This means that the command line command specified by one of your WDL tasks exited with a failing (i.e. nonzero) exit code, which will happen when either the command line command is written wrong, or when the error detection code in the tool you are trying to run detects and reports an error. Go up higher in the log until you find lines that look like: [2023-07-16T16:23:54-0700] [MainThread] [I] [toil.wdl.wdltoil] Standard error at /data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/stderr.txt: And [2023-07-16T16:23:54-0700] [MainThread] [I] [toil.wdl.wdltoil] Standard output at /data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/stdout.txt: These will be followed by the standard error and standard output log data from the task's command. There may be useful information (such as an error message from the underlying tool) in there. ===Reproducing Problems=== When trying to fix a failing step, it is useful to be able to run a command outside of Toil or WDL that might reproduce the problem. In addition to getting the standard output and standard error logs as described above, you may also need input files for your tool in order to do this. In the log of your failing Toil taks, look for likes like this: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-4f886176ab8344baaf17dc72fc445445/toplog.sh' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmprwhi6h3q/toplog.sh' [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmpjyksfoko/Sample.bam' ... The <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam</code> part is a Toil file ID, and it is a relative path from your <code>--jobStore</code> value to where the file is stored on disk. So if you ran the workflow with </code>--jobStore /private/groups/patenlab/anovak/jobstore</code>, you would look for this file at: /private/groups/patenlab/anovak/jobstore/files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam ===More Ways of Finding Files=== Sometimes, a step might not fail, but you still might want to see the files it is using as input. If you have the job store path, you can use the <code>find</code> command to try and find the files by name. For example, if you want to look at <code>Sample.bam</code>, you can look for it like this: find /path/to/the/jobstore -name "Sample.bam" If you want to find files that were ''uploaded'' from a job, look for lines like this in the job's log: [2023-07-16T15:58:39-0700] [MainThread] [D] [toil.wdl.wdltoil] Virtualized /data/tmp/2846b6012e3e5535add03b363950dd78/cb23/197c/work/bamPerChrs/Sample.chr14.bam as WDL file toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam You can take the <code>toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam</code> URI and URL-decode it with, for example, [https://www.urldecoder.io/], getting this: toilfile:2703483274:0:files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam Then you can take the part after the last colon, <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam</code>, and that is the path relative to the job store where this file can be found. ===Frequently Asked Questions=== ====I am getting warnings about <code>XDG_RUNTIME_DIR</code>==== You may be seeing warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code> This happens because Slurm is providing Toil with an <code>XDG_RUNTIME_DIR</code> environment variable that points to a directory that doesn't exist, which the XDG spec says it shouldn't be doing. This is a known bug in the GI Slurm configuration, and Toil is letting you know that it is working around it. ====Toil said it was <code>Redirecting logging</code> somewhere, but I can't find that file!==== Toil will say that it is <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code>, and when running in single machine mode these messages go to the main Toil log. The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil. If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node. ==Additional WDL resources== For more information on writing and running WDL workflows, see: * [https://docs.openwdl.org/en/stable/ The WDL dcoumentation] * [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube] * [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification] 0b5f5835441107b263170fcf50281a5bdd8011ab 397 396 2023-07-20T13:48:08Z Anovak 4 /* Installing Toil with WDL support */ wikitext text/x-wiki =Tutorial: Getting Started with WDL Workflows on Phoenix= Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you. Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order. This tutorial will help you get started writing and running workflows on the Genomics Institute's Phoenix cluster. By the end, you will be able to run workflows on Phoenix with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong. ==Setup== Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH. ===Getting VPN access=== We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall. To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster. ===Connecting to Phoenix=== Once you have VPN access, you can connect to the "head node" of the Phoenix cluster. This node is where everyone logs in, but you should *not* run actual work on this node; it exists only to give you access to the files on the cluster and to the commands to control cluster jobs. To connect to the head node: 1. Connect to the VPN. 2. SSH to <code>phoenix.prism</code>. At the command line, run: ssh phoenix.prism If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run: ssh flastname@phoenix.prism The first time you connect, you will see a message like: The authenticity of host 'phoenix.prism (10.50.1.66)' can't be established. ED25519 key fingerprint is SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI. This key is not known by any other names Are you sure you want to continue connecting (yes/no/[fingerprint])? This is your computer asking you to help it decide if it is talking to the genuine <code>phoenix.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is. ===Installing Toil with WDL support=== Once you are on the head node, you can install Toil, a program for running workflows. When installing, you need to specify that you want WDL support. To do this, you can run: pip install --upgrade --user 'toil[wdl]' If you also want to use AWS S3 <code>s3://</code> and/or Google <code>gs://</code> URLs for data, you will need to also install Toil with the <code>aws</code> and <code>google</code> extras, respectively: pip install --upgrade --user 'toil[wdl,aws,google]' This will install Toil in the <code>.local</code> directory inside your home directory, which we write as <code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>. By default, the command interpreter *will not* look there, so if you type <code>toil-wdl-runner</code>, it will complain that the command is not found. To fix this, you need to configure the command interpreter (bash) to look where Toil is installed. To do this, run: echo 'export PATH="${HOME}/.local/bin:${PATH}"' >>~/.bashrc After that, **log out and log back in**, to restart bash and pick up the change. To make sure it worked, you can run: toil-wdl-runner --help If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports. If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above. ===Configuring Toil for Phoenix=== Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. So, use these commands to make sure that Toil knows where it ought to put its caches: echo 'export SINGULARITY_CACHEDIR="${HOME}/.singularity/cache"' >>~/.bashrc echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="${HOME}/.cache/miniwdl"' >>~/.bashrc After that, **log out and log back in again**, to apply the changes. If you don't do this, Toil will re-download each container image, on each node, for each run of each workflow. That wastes a lot of time, and can exhaust the [https://docs.docker.com/docker-hub/download-rate-limit/#whats-the-download-rate-limit-on-docker-hub limits on how many containers you are allowed to download each day], so it is important not to skip this step. ==Running an existing workflow== First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project]. First, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally. wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl ===Preparing an input file=== Near the top of the WDL file, there's a section like this: workflow hello_caller { input { File who } This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one. So first, we have to make that list of names. Let's make it in <code>names.txt</code> echo "Mridula Resurrección" >names.txt echo "Gershom Šarlota" >>names.txt echo "Ritchie Ravi" >>names.txt Then, we need to create an *inputs file*, which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this: echo '{"hello_caller.who": "./names.txt"}' >inputs.json Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification]. ===Testing at small scale single-machine=== We are now ready to run the workflow! You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running: srun -c 2 --mem 8G --pty bash -i This will start a new shell; to leave it and go back to the head node you can use <code>exit</code>. In your new shell, run this Toil command: toil-wdl-runner self_test.wdl inputs.json -o local_run This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>. This will print a lot of logging to standard error, and to standard output it will print: {"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]} The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person. To leave your interactive Slurm session and return to the head node, use <code>exit</code>. ===Running at larger scale=== Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people! Go get this handy list of people and cut it to length: wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt head -n100 1000_names.txt >100_names.txt And make a new inputs file: echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster. To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it. mkdir -p logs toil-wdl-runner --jobStore ./big_store --batchSystem slurm --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory. ==Writing your own workflow== In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz]. ===Writing the file=== All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files. Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0. So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement: version 1.0 Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>. version 1.0 workflow FizzBuzz { } Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } } Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is *not* optional, and there is no default value, then the user's inputs file *must* specify a value for it in order for the workflow to run. Now we'll start on the body of the workflow, to be inserted just after the inputs section. The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable. Array[Int] numbers = range(item_count) WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more. Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 } Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array. Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only *once* in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>. Note that WDL supports conditional *expressions* with a <code>then</code> and an <code>else</code>, but conditional *statements* only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition. So first, let's handle the special cases. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. } } Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later. To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]) } We can put the code into the workflow now, and set about writing the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>. task stringify_number { } We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections. task stringify_number { input { Int the_number } # ??? output { String the_string # = ??? } } Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string # = ??? } } Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]). task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } } We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage. Then we can put our task into our WDL file: version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren's supposed to need this, but you do need it in 1.1 and Toil doesn't actually support not having one yet, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } output { Array[String] fizzbuzz_results = result } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array. Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs: echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json Then run it on the cluster with Toil: mkdir -p logs toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json Or locally: toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json ==Debugging Workflows== Sometimes, your workflow won't work. Try these ideas for figuring out what is going wrong. ===Debugging Options=== When debugging a workflow, make sure to run the workflow with <code>--logDebug</code>, to set the log level to <code>DEBUG</code>, and with <code>--jobStore /some/path/to/a/shared/directory/it/can/create</code> so that the stored files shipped between jobs are in a place you can access them. When debug logging is on, the log from every Toil job is inserted in the main Toil log between these markers: =========> Toil job log is here <========= Normally, only the logs of failing jobs are reproduced like this. ===Reading the Log=== When a WDL workflow fails, you are likely to see a message like this: WDL.runtime.error.CommandFailed: task command failed with exit status 1 [2023-07-16T16:23:54-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host phoenix-15.prism This means that the command line command specified by one of your WDL tasks exited with a failing (i.e. nonzero) exit code, which will happen when either the command line command is written wrong, or when the error detection code in the tool you are trying to run detects and reports an error. Go up higher in the log until you find lines that look like: [2023-07-16T16:23:54-0700] [MainThread] [I] [toil.wdl.wdltoil] Standard error at /data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/stderr.txt: And [2023-07-16T16:23:54-0700] [MainThread] [I] [toil.wdl.wdltoil] Standard output at /data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/stdout.txt: These will be followed by the standard error and standard output log data from the task's command. There may be useful information (such as an error message from the underlying tool) in there. ===Reproducing Problems=== When trying to fix a failing step, it is useful to be able to run a command outside of Toil or WDL that might reproduce the problem. In addition to getting the standard output and standard error logs as described above, you may also need input files for your tool in order to do this. In the log of your failing Toil taks, look for likes like this: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-4f886176ab8344baaf17dc72fc445445/toplog.sh' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmprwhi6h3q/toplog.sh' [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmpjyksfoko/Sample.bam' ... The <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam</code> part is a Toil file ID, and it is a relative path from your <code>--jobStore</code> value to where the file is stored on disk. So if you ran the workflow with </code>--jobStore /private/groups/patenlab/anovak/jobstore</code>, you would look for this file at: /private/groups/patenlab/anovak/jobstore/files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam ===More Ways of Finding Files=== Sometimes, a step might not fail, but you still might want to see the files it is using as input. If you have the job store path, you can use the <code>find</code> command to try and find the files by name. For example, if you want to look at <code>Sample.bam</code>, you can look for it like this: find /path/to/the/jobstore -name "Sample.bam" If you want to find files that were ''uploaded'' from a job, look for lines like this in the job's log: [2023-07-16T15:58:39-0700] [MainThread] [D] [toil.wdl.wdltoil] Virtualized /data/tmp/2846b6012e3e5535add03b363950dd78/cb23/197c/work/bamPerChrs/Sample.chr14.bam as WDL file toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam You can take the <code>toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam</code> URI and URL-decode it with, for example, [https://www.urldecoder.io/], getting this: toilfile:2703483274:0:files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam Then you can take the part after the last colon, <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam</code>, and that is the path relative to the job store where this file can be found. ===Frequently Asked Questions=== ====I am getting warnings about <code>XDG_RUNTIME_DIR</code>==== You may be seeing warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code> This happens because Slurm is providing Toil with an <code>XDG_RUNTIME_DIR</code> environment variable that points to a directory that doesn't exist, which the XDG spec says it shouldn't be doing. This is a known bug in the GI Slurm configuration, and Toil is letting you know that it is working around it. ====Toil said it was <code>Redirecting logging</code> somewhere, but I can't find that file!==== Toil will say that it is <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code>, and when running in single machine mode these messages go to the main Toil log. The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil. If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node. ==Additional WDL resources== For more information on writing and running WDL workflows, see: * [https://docs.openwdl.org/en/stable/ The WDL dcoumentation] * [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube] * [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification] b76f0a797a48b9fe149e899fb6e2ffb741feabae 398 397 2023-07-20T13:53:45Z Anovak 4 /* Debugging Workflows */ wikitext text/x-wiki =Tutorial: Getting Started with WDL Workflows on Phoenix= Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you. Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order. This tutorial will help you get started writing and running workflows on the Genomics Institute's Phoenix cluster. By the end, you will be able to run workflows on Phoenix with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong. ==Setup== Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH. ===Getting VPN access=== We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall. To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster. ===Connecting to Phoenix=== Once you have VPN access, you can connect to the "head node" of the Phoenix cluster. This node is where everyone logs in, but you should *not* run actual work on this node; it exists only to give you access to the files on the cluster and to the commands to control cluster jobs. To connect to the head node: 1. Connect to the VPN. 2. SSH to <code>phoenix.prism</code>. At the command line, run: ssh phoenix.prism If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run: ssh flastname@phoenix.prism The first time you connect, you will see a message like: The authenticity of host 'phoenix.prism (10.50.1.66)' can't be established. ED25519 key fingerprint is SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI. This key is not known by any other names Are you sure you want to continue connecting (yes/no/[fingerprint])? This is your computer asking you to help it decide if it is talking to the genuine <code>phoenix.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is. ===Installing Toil with WDL support=== Once you are on the head node, you can install Toil, a program for running workflows. When installing, you need to specify that you want WDL support. To do this, you can run: pip install --upgrade --user 'toil[wdl]' If you also want to use AWS S3 <code>s3://</code> and/or Google <code>gs://</code> URLs for data, you will need to also install Toil with the <code>aws</code> and <code>google</code> extras, respectively: pip install --upgrade --user 'toil[wdl,aws,google]' This will install Toil in the <code>.local</code> directory inside your home directory, which we write as <code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>. By default, the command interpreter *will not* look there, so if you type <code>toil-wdl-runner</code>, it will complain that the command is not found. To fix this, you need to configure the command interpreter (bash) to look where Toil is installed. To do this, run: echo 'export PATH="${HOME}/.local/bin:${PATH}"' >>~/.bashrc After that, **log out and log back in**, to restart bash and pick up the change. To make sure it worked, you can run: toil-wdl-runner --help If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports. If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above. ===Configuring Toil for Phoenix=== Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. So, use these commands to make sure that Toil knows where it ought to put its caches: echo 'export SINGULARITY_CACHEDIR="${HOME}/.singularity/cache"' >>~/.bashrc echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="${HOME}/.cache/miniwdl"' >>~/.bashrc After that, **log out and log back in again**, to apply the changes. If you don't do this, Toil will re-download each container image, on each node, for each run of each workflow. That wastes a lot of time, and can exhaust the [https://docs.docker.com/docker-hub/download-rate-limit/#whats-the-download-rate-limit-on-docker-hub limits on how many containers you are allowed to download each day], so it is important not to skip this step. ==Running an existing workflow== First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project]. First, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally. wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl ===Preparing an input file=== Near the top of the WDL file, there's a section like this: workflow hello_caller { input { File who } This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one. So first, we have to make that list of names. Let's make it in <code>names.txt</code> echo "Mridula Resurrección" >names.txt echo "Gershom Šarlota" >>names.txt echo "Ritchie Ravi" >>names.txt Then, we need to create an *inputs file*, which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this: echo '{"hello_caller.who": "./names.txt"}' >inputs.json Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification]. ===Testing at small scale single-machine=== We are now ready to run the workflow! You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running: srun -c 2 --mem 8G --pty bash -i This will start a new shell; to leave it and go back to the head node you can use <code>exit</code>. In your new shell, run this Toil command: toil-wdl-runner self_test.wdl inputs.json -o local_run This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>. This will print a lot of logging to standard error, and to standard output it will print: {"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]} The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person. To leave your interactive Slurm session and return to the head node, use <code>exit</code>. ===Running at larger scale=== Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people! Go get this handy list of people and cut it to length: wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt head -n100 1000_names.txt >100_names.txt And make a new inputs file: echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster. To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it. mkdir -p logs toil-wdl-runner --jobStore ./big_store --batchSystem slurm --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory. ==Writing your own workflow== In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz]. ===Writing the file=== All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files. Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0. So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement: version 1.0 Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>. version 1.0 workflow FizzBuzz { } Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } } Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is *not* optional, and there is no default value, then the user's inputs file *must* specify a value for it in order for the workflow to run. Now we'll start on the body of the workflow, to be inserted just after the inputs section. The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable. Array[Int] numbers = range(item_count) WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more. Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 } Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array. Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only *once* in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>. Note that WDL supports conditional *expressions* with a <code>then</code> and an <code>else</code>, but conditional *statements* only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition. So first, let's handle the special cases. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. } } Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later. To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]) } We can put the code into the workflow now, and set about writing the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>. task stringify_number { } We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections. task stringify_number { input { Int the_number } # ??? output { String the_string # = ??? } } Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string # = ??? } } Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]). task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } } We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage. Then we can put our task into our WDL file: version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren's supposed to need this, but you do need it in 1.1 and Toil doesn't actually support not having one yet, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } output { Array[String] fizzbuzz_results = result } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array. Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs: echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json Then run it on the cluster with Toil: mkdir -p logs toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json Or locally: toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json ==Debugging Workflows== Sometimes, your workflow won't work. Try these ideas for figuring out what is going wrong. ===Debugging Options=== When debugging a workflow, make sure to run the workflow with <code>--logDebug</code>, to set the log level to <code>DEBUG</code>, and with <code>--jobStore /some/path/to/a/shared/directory/it/can/create</code> so that the stored files shipped between jobs are in a place you can access them. When debug logging is on, the log from every Toil job is inserted in the main Toil log between these markers: =========> Toil job log is here <========= Normally, only the logs of failing jobs are reproduced like this. ===Reading the Log=== When a WDL workflow fails, you are likely to see a message like this: WDL.runtime.error.CommandFailed: task command failed with exit status 1 [2023-07-16T16:23:54-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host phoenix-15.prism This means that the command line command specified by one of your WDL tasks exited with a failing (i.e. nonzero) exit code, which will happen when either the command line command is written wrong, or when the error detection code in the tool you are trying to run detects and reports an error. Go up higher in the log until you find lines that look like: [2023-07-16T16:23:54-0700] [MainThread] [I] [toil.wdl.wdltoil] Standard error at /data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/stderr.txt: And [2023-07-16T16:23:54-0700] [MainThread] [I] [toil.wdl.wdltoil] Standard output at /data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/stdout.txt: These will be followed by the standard error and standard output log data from the task's command. There may be useful information (such as an error message from the underlying tool) in there. ===Reproducing Problems=== When trying to fix a failing step, it is useful to be able to run a command outside of Toil or WDL that might reproduce the problem. In addition to getting the standard output and standard error logs as described above, you may also need input files for your tool in order to do this. In the log of your failing Toil taks, look for likes like this: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-4f886176ab8344baaf17dc72fc445445/toplog.sh' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmprwhi6h3q/toplog.sh' [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmpjyksfoko/Sample.bam' ... The <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam</code> part is a Toil file ID, and it is a relative path from your <code>--jobStore</code> value to where the file is stored on disk. So if you ran the workflow with </code>--jobStore /private/groups/patenlab/anovak/jobstore</code>, you would look for this file at: /private/groups/patenlab/anovak/jobstore/files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam ===More Ways of Finding Files=== Sometimes, a step might not fail, but you still might want to see the files it is using as input. If you have the job store path, you can use the <code>find</code> command to try and find the files by name. For example, if you want to look at <code>Sample.bam</code>, you can look for it like this: find /path/to/the/jobstore -name "Sample.bam" If you want to find files that were ''uploaded'' from a job, look for lines like this in the job's log: [2023-07-16T15:58:39-0700] [MainThread] [D] [toil.wdl.wdltoil] Virtualized /data/tmp/2846b6012e3e5535add03b363950dd78/cb23/197c/work/bamPerChrs/Sample.chr14.bam as WDL file toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam You can take the <code>toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam</code> URI and URL-decode it with, for example, [https://www.urldecoder.io/], getting this: toilfile:2703483274:0:files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam Then you can take the part after the last colon, <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam</code>, and that is the path relative to the job store where this file can be found. ===Using Development Versions of Toil=== Sometimes, bugs will be fixed in the development version of Toil, but not released yet. To try the current development version of Toil, you can install it like this: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl,aws,google]' If you want to use a particular branch or commit, like <code>aaa451b320fc115b3563ced25cb501301cf86f90</code>, you can do: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git@aaa451b320fc115b3563ced25cb501301cf86f90#egg=toil[wdl,aws,google]' ===Frequently Asked Questions=== ====I am getting warnings about <code>XDG_RUNTIME_DIR</code>==== You may be seeing warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code> This happens because Slurm is providing Toil with an <code>XDG_RUNTIME_DIR</code> environment variable that points to a directory that doesn't exist, which the XDG spec says it shouldn't be doing. This is a known bug in the GI Slurm configuration, and Toil is letting you know that it is working around it. ====Toil said it was <code>Redirecting logging</code> somewhere, but I can't find that file!==== Toil will say that it is <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code>, and when running in single machine mode these messages go to the main Toil log. The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil. If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node. ==Additional WDL resources== For more information on writing and running WDL workflows, see: * [https://docs.openwdl.org/en/stable/ The WDL dcoumentation] * [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube] * [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification] c0939f6ac8d6e2014332a60ef6c2a0e11f44116b How to access the public servers 0 11 378 268 2023-06-29T01:53:06Z Weiler 3 /* Server Types and Management */ wikitext text/x-wiki == How to Gain Access to the Public Genomics Institute Compute Servers == If you need access to the Genomics Institute compute servers please complete this request form: https://app.smartsheet.com/b/form/a76dbd90ba0240ab9ea9d39b390586ce. There are two parts in this process. 1. For the user, please fill in ALL required fields and submit. 2. For the Sponsor/PI - you will receive an email from Smartsheet. Please fill in all required fields and submit. We will receive your completed request and we will create your account and go over the details via a short zoom meeting with you. == Account and Storage Cost == Costs for having an active UNIX account and for storage (per TB) are listed in this document, under "Genomics Project Support", specifically "Genomics IT Systems User Support per user" and "Genomics Data Storage per TB": https://planning.ucsc.edu/budget/rates-and-assessments/recharge-rates/docs/2021-22-approved-recharge-rates.pdf == Account Expiration == Your UNIX account will have an expiration date associated with it after creation, as requested by your sponsor. Please take note of this expiration date when your account is created. You will receive notice by email when your account is about to expire. To renew, simply ask the PI that sponsored you (which will be included in the notice) to email '''cluster-admin@soe.ucsc.edu''' requesting that your account to be renewed for another year, or any other requested amount of time. If your account expires, the account will be suspended and you will no longer be able to login or view any data you may have in our systems. Any automated scripts (owned by you) that run via cron or other mechanisms will cease to function. == Server Types and Management== You can ssh into our public compute servers via SSH: '''courtyard.gi.ucsc.edu''': 1TB RAM, 64 cores, 672GB local scratch space, CentOS 7.9 '''park.gi.ucsc.edu''': 256GB RAM, 32 cores, 5TB local scratch space, Ubuntu 22.04.1 These servers are managed by the Genomics Institute Cluster Admin group. If you need software installed on them, please make your request by emailing cluster-admin@soe.ucsc.edu. == Storage == These servers mount two types of storage; home directories and group storage directories. Your home directory will be located as "/public/home/username" and has a 30GB quota. The group storage directories are created per PI, and each group directory has a 15TB quota. For example, if David Haussler is the PI that you report to directly, then the directory would exist as /public/groups/hausslerlab. Request access to that group directory and you will then be able to write to it. Each of those group directories are shared by the lab it belongs to, so you must be wary of everyone's data usage and share the 15TB available per group accordingly. On the compute servers you can check your group's current quota usage by using the '/usr/bin/viewquota' command. You can only check the quota of a group you are part of (you would be a member of the UNIX group of the same name). If you wanted to check the quota usage of /public/groups/hausslerlab for example, you would do: $ viewquota hausslerlab Project quota on /export (/dev/mapper/export) Project ID Used Soft Hard Warn/Grace ---------- --------------------------------- hausslerlab 1.8T 15T 16T 00 [------] == Actually Doing Work and Computing == When doing research, running jobs and the like, please be careful of your resource consumption on the server you are on. Don't run too many threads or cores at once if such a thing overruns the RAM available or the disk IO available. If you are not sure of your potential RAM, CPU or disk impact, start small with one or two processes and work your way up from there. Also, before running your stuff, check what else is already happening on the server by using the 'top' command to see who else and what else is running and what kind of resources are already being consumed. If, after starting a process, you realize that the server slows down considerably or becomes unusable, kill your processes and re-evaluate what you need to make things work. These servers are shared resources - be a good neighbor! == Serving Files to the Public via the Web == If you want to setup a web page on courtyard, or serve files over HTTP from there, do this: mkdir /public/home/''your_username''/public_html chmod 755 /public/home/''your_username''/public_html Put data in the public_html directory. The URL will be: http://public.gi.ucsc.edu/''~username''/ == /scratch Space on the Servers == Each server will generally have a local /scratch filesystem that you can use to store temporary files. '''BE ADVISED''' that /scratch is not backed up, and the data there could disappear in the event of a disk failure or anything else. Do not store important data there. If it is important, it should be moved somewhere else very soon after creation. 29bbd18d4daa13460f3fe780d7c1beff27e34e18 399 378 2023-07-25T19:23:25Z Weiler 3 /* /scratch Space on the Servers */ wikitext text/x-wiki == How to Gain Access to the Public Genomics Institute Compute Servers == If you need access to the Genomics Institute compute servers please complete this request form: https://app.smartsheet.com/b/form/a76dbd90ba0240ab9ea9d39b390586ce. There are two parts in this process. 1. For the user, please fill in ALL required fields and submit. 2. For the Sponsor/PI - you will receive an email from Smartsheet. Please fill in all required fields and submit. We will receive your completed request and we will create your account and go over the details via a short zoom meeting with you. == Account and Storage Cost == Costs for having an active UNIX account and for storage (per TB) are listed in this document, under "Genomics Project Support", specifically "Genomics IT Systems User Support per user" and "Genomics Data Storage per TB": https://planning.ucsc.edu/budget/rates-and-assessments/recharge-rates/docs/2021-22-approved-recharge-rates.pdf == Account Expiration == Your UNIX account will have an expiration date associated with it after creation, as requested by your sponsor. Please take note of this expiration date when your account is created. You will receive notice by email when your account is about to expire. To renew, simply ask the PI that sponsored you (which will be included in the notice) to email '''cluster-admin@soe.ucsc.edu''' requesting that your account to be renewed for another year, or any other requested amount of time. If your account expires, the account will be suspended and you will no longer be able to login or view any data you may have in our systems. Any automated scripts (owned by you) that run via cron or other mechanisms will cease to function. == Server Types and Management== You can ssh into our public compute servers via SSH: '''courtyard.gi.ucsc.edu''': 1TB RAM, 64 cores, 672GB local scratch space, CentOS 7.9 '''park.gi.ucsc.edu''': 256GB RAM, 32 cores, 5TB local scratch space, Ubuntu 22.04.1 These servers are managed by the Genomics Institute Cluster Admin group. If you need software installed on them, please make your request by emailing cluster-admin@soe.ucsc.edu. == Storage == These servers mount two types of storage; home directories and group storage directories. Your home directory will be located as "/public/home/username" and has a 30GB quota. The group storage directories are created per PI, and each group directory has a 15TB quota. For example, if David Haussler is the PI that you report to directly, then the directory would exist as /public/groups/hausslerlab. Request access to that group directory and you will then be able to write to it. Each of those group directories are shared by the lab it belongs to, so you must be wary of everyone's data usage and share the 15TB available per group accordingly. On the compute servers you can check your group's current quota usage by using the '/usr/bin/viewquota' command. You can only check the quota of a group you are part of (you would be a member of the UNIX group of the same name). If you wanted to check the quota usage of /public/groups/hausslerlab for example, you would do: $ viewquota hausslerlab Project quota on /export (/dev/mapper/export) Project ID Used Soft Hard Warn/Grace ---------- --------------------------------- hausslerlab 1.8T 15T 16T 00 [------] == Actually Doing Work and Computing == When doing research, running jobs and the like, please be careful of your resource consumption on the server you are on. Don't run too many threads or cores at once if such a thing overruns the RAM available or the disk IO available. If you are not sure of your potential RAM, CPU or disk impact, start small with one or two processes and work your way up from there. Also, before running your stuff, check what else is already happening on the server by using the 'top' command to see who else and what else is running and what kind of resources are already being consumed. If, after starting a process, you realize that the server slows down considerably or becomes unusable, kill your processes and re-evaluate what you need to make things work. These servers are shared resources - be a good neighbor! == Serving Files to the Public via the Web == If you want to setup a web page on courtyard, or serve files over HTTP from there, do this: mkdir /public/home/''your_username''/public_html chmod 755 /public/home/''your_username''/public_html Put data in the public_html directory. The URL will be: http://public.gi.ucsc.edu/''~username''/ == /data/scratch Space on the Servers == Each server will generally have a local /data/scratch filesystem that you can use to store temporary files. '''BE ADVISED''' that /data/scratch is not backed up, and the data there could disappear in the event of a disk failure or anything else. Do not store important data there. If it is important, it should be moved somewhere else very soon after creation. 2c10281d4fbfeb2a15b6cfaed0180459deb84032 401 399 2023-07-28T18:17:41Z Weiler 3 /* Server Types and Management */ wikitext text/x-wiki == How to Gain Access to the Public Genomics Institute Compute Servers == If you need access to the Genomics Institute compute servers please complete this request form: https://app.smartsheet.com/b/form/a76dbd90ba0240ab9ea9d39b390586ce. There are two parts in this process. 1. For the user, please fill in ALL required fields and submit. 2. For the Sponsor/PI - you will receive an email from Smartsheet. Please fill in all required fields and submit. We will receive your completed request and we will create your account and go over the details via a short zoom meeting with you. == Account and Storage Cost == Costs for having an active UNIX account and for storage (per TB) are listed in this document, under "Genomics Project Support", specifically "Genomics IT Systems User Support per user" and "Genomics Data Storage per TB": https://planning.ucsc.edu/budget/rates-and-assessments/recharge-rates/docs/2021-22-approved-recharge-rates.pdf == Account Expiration == Your UNIX account will have an expiration date associated with it after creation, as requested by your sponsor. Please take note of this expiration date when your account is created. You will receive notice by email when your account is about to expire. To renew, simply ask the PI that sponsored you (which will be included in the notice) to email '''cluster-admin@soe.ucsc.edu''' requesting that your account to be renewed for another year, or any other requested amount of time. If your account expires, the account will be suspended and you will no longer be able to login or view any data you may have in our systems. Any automated scripts (owned by you) that run via cron or other mechanisms will cease to function. == Server Types and Management== You can ssh into our public compute servers via SSH: '''courtyard.gi.ucsc.edu''': 1TB RAM, 64 cores, 672GB local scratch space, Ubuntu 22.04.2 '''park.gi.ucsc.edu''': 256GB RAM, 32 cores, 5TB local scratch space, Ubuntu 22.04.2 These servers are managed by the Genomics Institute Cluster Admin group. If you need software installed on them, please make your request by emailing cluster-admin@soe.ucsc.edu. == Storage == These servers mount two types of storage; home directories and group storage directories. Your home directory will be located as "/public/home/username" and has a 30GB quota. The group storage directories are created per PI, and each group directory has a 15TB quota. For example, if David Haussler is the PI that you report to directly, then the directory would exist as /public/groups/hausslerlab. Request access to that group directory and you will then be able to write to it. Each of those group directories are shared by the lab it belongs to, so you must be wary of everyone's data usage and share the 15TB available per group accordingly. On the compute servers you can check your group's current quota usage by using the '/usr/bin/viewquota' command. You can only check the quota of a group you are part of (you would be a member of the UNIX group of the same name). If you wanted to check the quota usage of /public/groups/hausslerlab for example, you would do: $ viewquota hausslerlab Project quota on /export (/dev/mapper/export) Project ID Used Soft Hard Warn/Grace ---------- --------------------------------- hausslerlab 1.8T 15T 16T 00 [------] == Actually Doing Work and Computing == When doing research, running jobs and the like, please be careful of your resource consumption on the server you are on. Don't run too many threads or cores at once if such a thing overruns the RAM available or the disk IO available. If you are not sure of your potential RAM, CPU or disk impact, start small with one or two processes and work your way up from there. Also, before running your stuff, check what else is already happening on the server by using the 'top' command to see who else and what else is running and what kind of resources are already being consumed. If, after starting a process, you realize that the server slows down considerably or becomes unusable, kill your processes and re-evaluate what you need to make things work. These servers are shared resources - be a good neighbor! == Serving Files to the Public via the Web == If you want to setup a web page on courtyard, or serve files over HTTP from there, do this: mkdir /public/home/''your_username''/public_html chmod 755 /public/home/''your_username''/public_html Put data in the public_html directory. The URL will be: http://public.gi.ucsc.edu/''~username''/ == /data/scratch Space on the Servers == Each server will generally have a local /data/scratch filesystem that you can use to store temporary files. '''BE ADVISED''' that /data/scratch is not backed up, and the data there could disappear in the event of a disk failure or anything else. Do not store important data there. If it is important, it should be moved somewhere else very soon after creation. 717a7ba41908aa9f3cb3ea480de4c787cb267841 MediaWiki:Sidebar 8 5 391 74 2023-07-16T03:46:45Z Weiler 3 wikitext text/x-wiki * navigation ** Genomics Institute Computing Information|Genomics Institute Computing Information ** recentchanges-url|recentchanges ** helppage|help 1c42f6c40ffb28cc861ce5f20de7ce9c751807ce MediaWiki:Common.css 8 46 402 2023-07-29T18:20:47Z Weiler 3 Created page with "/* CSS placed here will be applied to all skins */ .mw-body-content { line-height: 5; }" css text/css /* CSS placed here will be applied to all skins */ .mw-body-content { line-height: 5; } da2f31e0a52c2961d80ccdaf69775cefeec238d1 403 402 2023-07-29T18:21:17Z Weiler 3 css text/css /* CSS placed here will be applied to all skins */ .mw-body-content { line-height: 1; } 0c6528b1ccd90fca1b05d949e324ec55380801bd MediaWiki:Common.css 8 46 404 403 2023-07-29T18:24:42Z Weiler 3 css text/css /* CSS placed here will be applied to all skins */ .mw-body-content { line-height: 1; font-size: 10px; } cc9bcfb028b740f3644d6fc11e904801492b9a09 405 404 2023-07-29T18:25:05Z Weiler 3 css text/css /* CSS placed here will be applied to all skins */ .mw-body-content { line-height: 1; font-size: 12px; } 08072ecb1a51d3c81213bb1cde816f3c95caf9b7 406 405 2023-07-29T18:25:35Z Weiler 3 css text/css /* CSS placed here will be applied to all skins */ .mw-body-content { line-height: 1; font-size: 13px; } 8200c5d04c68a3de809bde44225180cf92d0df50 407 406 2023-07-29T18:26:45Z Weiler 3 css text/css /* CSS placed here will be applied to all skins */ .mw-body-content { line-height: 1.5; font-size: 13px; } 8e7e5dac8863e052bf153518db5513d08aebe951 408 407 2023-07-29T18:27:05Z Weiler 3 css text/css /* CSS placed here will be applied to all skins */ .mw-body-content { line-height: 1.2; font-size: 13px; } 3d33e08c139ad2d4f78e22f02ff97cad93a6602f 409 408 2023-07-29T18:30:16Z Weiler 3 css text/css /* CSS placed here will be applied to all skins */ .mw-body-content { line-height: 1.3; font-size: 13px; } 65fb52668cb426f6c386e8038ebc68e8d48ec5a9 410 409 2023-07-29T18:46:21Z Weiler 3 css text/css /* CSS placed here will be applied to all skins */ .mw-body-content { line-height: 1.3; font-size: 14px; } 90e458c9d5c6d0f41cab196f00aaac700e201e2c Firewalled Computing Resources Overview 0 41 411 400 2023-08-07T14:44:25Z Weiler 3 /* The Phoenix Cluster */ wikitext text/x-wiki == Doing Work and Computing == When doing research, running jobs and the like, please be careful of your resource consumption on the server you are on. Don't run too many threads or cores at once if such a thing overruns the RAM available or the disk IO available. If you are not sure of your potential RAM, CPU or disk impact, start small with one or two processes and work your way up from there. Also, before running your stuff, check what else is already happening on the server by using the 'top' command to see who else and what else is running and what kind of resources are already being consumed. If, after starting a process, you realize that the server slows down considerably or becomes unusable, kill your processes and re-evaluate what you need to make things work. These servers are shared resources - be a good neighbor! == Server Types and Management == After confirming your VPN software is working, you can ssh into one of the shared compute servers behind the VPN. The DNS suffix for all machines is ".prism". So, "mustard" would have a full DNS name of "mustard.prism": {| class="wikitable" style="text-align:center;" |- style="font-weight:bold; text-align:left;" ! Node Name ! Operating System<br /> ! CPU Cores ! Memory ! Network Bandwidth ! Scratch Space |- | style="text-align:left;" | mustard | style="text-align:left;" | Ubuntu 22.04 | 160 | 1.5 TB | 10 Gb/s | 9 TB |- | style="text-align:left;" | emerald | style="text-align:left;" | Ubuntu 22.04 | 64 | 1 TB | 10 Gb/s | 690 GB |- | style="text-align:left;" | crimson | style="text-align:left;" | Ubuntu 22.04 | 32 | 256 GB | 10 Gb/s | 5.5 TB |- | style="text-align:left;" | razzmatazz | style="text-align:left;" | Ubuntu 22.04 | 32 | 256 GB | 10 Gb/s | 5.5 TB |} These ''shared'' servers are managed by the Genomics Institute Cluster Admin group. If you need software installed on any of these servers, please make your request by emailing cluster-admin@soe.ucsc.edu. == The Firewall == All servers are behind a firewall in this environment, and as such, you must connect to the VPN in order to access them. They will not be accessible from the greater Internet without VPN. Although you will be able to connect outbound from them to other servers on the internet to copy data in, sync git repos, stuff like that. It is only inbound connections that will be blocked. All machines behind the firewall have the private domain name suffix of "*.prism". == The Phoenix Cluster == This is a cluster of ~22 Ubuntu 22.04 nodes, some of which have GPUs in them. Each node generally has about 2TB RAM and 128 cores, although the cluster is heterogeneous and has multiple node types. {| class="wikitable" style="text-align:center;" |- style="font-weight:bold;" ! Node Name ! Operating System<br /> ! CPU Cores ! GPUs/Type ! Memory ! Network Bandwidth ! Scratch Space |- | style="text-align:left;" | phoenix-00 | style="text-align:left;" | Ubuntu 22.04 | 256 | 8 / Nvidia A100 | 1 TB | 10 Gb/s | 16 TB NVMe |- | style="text-align:left;" | phoenix-[01-05] | style="text-align:left;" | Ubuntu 22.04 | 128 | 8 / Nvidia A5500 | 2 TB | 10 Gb/s | 16 TB NVMe |- | style="text-align:left;" | phoenix-[06-08] | style="text-align:left;" | Ubuntu 22.04 | 128 | N/A | 2 TB | 10 Gb/s | 16 TB NVMe |- | style="text-align:left;" | phoenix-09 | style="text-align:left;" | Ubuntu 22.04 | 40 | 8 / Nvidia 1080ti | 256 GB | 10 Gb/s | 1.5 TB NVMe |- | style="text-align:left;" | phoenix-10 | style="text-align:left;" | Ubuntu 22.04 | 36 | 4 / Nvidia 2080ti | 385 GB | 10 Gb/s | 3 TB NVMe |- | style="text-align:left;" | phoenix-[11-21] | style="text-align:left;" | Ubuntu 22.04 | 128 | N/A | 2 TB | 10 Gb/s | 16 TB NVMe |} The cluster head node, from which all jobs are submitted via the SLURM job scheduling framework, is '''phoenix.prism'''. To learn more about how to use Slurm, refer to: https://giwiki.gi.ucsc.edu/index.php/Genomics_Institute_Computing_Information#Slurm_at_the_Genomics_Institute For scratch on the cluster, TMPDIR will be set to /data/tmp. That area is cleaned often so don't store any data there that isn't being used by your jobs. 5586859aa9934e49abc1636c3023441f8ccdca75 450 411 2023-12-06T18:30:47Z Weiler 3 /* The Phoenix Cluster */ wikitext text/x-wiki == Doing Work and Computing == When doing research, running jobs and the like, please be careful of your resource consumption on the server you are on. Don't run too many threads or cores at once if such a thing overruns the RAM available or the disk IO available. If you are not sure of your potential RAM, CPU or disk impact, start small with one or two processes and work your way up from there. Also, before running your stuff, check what else is already happening on the server by using the 'top' command to see who else and what else is running and what kind of resources are already being consumed. If, after starting a process, you realize that the server slows down considerably or becomes unusable, kill your processes and re-evaluate what you need to make things work. These servers are shared resources - be a good neighbor! == Server Types and Management == After confirming your VPN software is working, you can ssh into one of the shared compute servers behind the VPN. The DNS suffix for all machines is ".prism". So, "mustard" would have a full DNS name of "mustard.prism": {| class="wikitable" style="text-align:center;" |- style="font-weight:bold; text-align:left;" ! Node Name ! Operating System<br /> ! CPU Cores ! Memory ! Network Bandwidth ! Scratch Space |- | style="text-align:left;" | mustard | style="text-align:left;" | Ubuntu 22.04 | 160 | 1.5 TB | 10 Gb/s | 9 TB |- | style="text-align:left;" | emerald | style="text-align:left;" | Ubuntu 22.04 | 64 | 1 TB | 10 Gb/s | 690 GB |- | style="text-align:left;" | crimson | style="text-align:left;" | Ubuntu 22.04 | 32 | 256 GB | 10 Gb/s | 5.5 TB |- | style="text-align:left;" | razzmatazz | style="text-align:left;" | Ubuntu 22.04 | 32 | 256 GB | 10 Gb/s | 5.5 TB |} These ''shared'' servers are managed by the Genomics Institute Cluster Admin group. If you need software installed on any of these servers, please make your request by emailing cluster-admin@soe.ucsc.edu. == The Firewall == All servers are behind a firewall in this environment, and as such, you must connect to the VPN in order to access them. They will not be accessible from the greater Internet without VPN. Although you will be able to connect outbound from them to other servers on the internet to copy data in, sync git repos, stuff like that. It is only inbound connections that will be blocked. All machines behind the firewall have the private domain name suffix of "*.prism". == The Phoenix Cluster == This is a cluster of ~22 Ubuntu 22.04 nodes, some of which have GPUs in them. Each node generally has about 2TB RAM and 128 cores, although the cluster is heterogeneous and has multiple node types. You interact with the Phoenix Cluster via the Slurm Job Scheduler. You must specifically request access to use Slurm on the Phoenix Cluster, just email '''cluster-admin@soe.ucsc.edu''' for access. {| class="wikitable" style="text-align:center;" |- style="font-weight:bold;" ! Node Name ! Operating System<br /> ! CPU Cores ! GPUs/Type ! Memory ! Network Bandwidth ! Scratch Space |- | style="text-align:left;" | phoenix-00 | style="text-align:left;" | Ubuntu 22.04 | 256 | 8 / Nvidia A100 | 1 TB | 10 Gb/s | 16 TB NVMe |- | style="text-align:left;" | phoenix-[01-05] | style="text-align:left;" | Ubuntu 22.04 | 128 | 8 / Nvidia A5500 | 2 TB | 10 Gb/s | 16 TB NVMe |- | style="text-align:left;" | phoenix-[06-08] | style="text-align:left;" | Ubuntu 22.04 | 128 | N/A | 2 TB | 10 Gb/s | 16 TB NVMe |- | style="text-align:left;" | phoenix-09 | style="text-align:left;" | Ubuntu 22.04 | 40 | 8 / Nvidia 1080ti | 256 GB | 10 Gb/s | 1.5 TB NVMe |- | style="text-align:left;" | phoenix-10 | style="text-align:left;" | Ubuntu 22.04 | 36 | 4 / Nvidia 2080ti | 385 GB | 10 Gb/s | 3 TB NVMe |- | style="text-align:left;" | phoenix-[11-21] | style="text-align:left;" | Ubuntu 22.04 | 128 | N/A | 2 TB | 10 Gb/s | 16 TB NVMe |} The cluster head node, from which all jobs are submitted via the SLURM job scheduling framework, is '''phoenix.prism'''. To learn more about how to use Slurm, refer to: https://giwiki.gi.ucsc.edu/index.php/Genomics_Institute_Computing_Information#Slurm_at_the_Genomics_Institute For scratch on the cluster, TMPDIR will be set to /data/tmp. That area is cleaned often so don't store any data there that isn't being used by your jobs. f6dd4fd4f087f159cb3a36b359fbb41d25cc4b5d 451 450 2023-12-06T18:31:48Z Weiler 3 /* The Phoenix Cluster */ wikitext text/x-wiki == Doing Work and Computing == When doing research, running jobs and the like, please be careful of your resource consumption on the server you are on. Don't run too many threads or cores at once if such a thing overruns the RAM available or the disk IO available. If you are not sure of your potential RAM, CPU or disk impact, start small with one or two processes and work your way up from there. Also, before running your stuff, check what else is already happening on the server by using the 'top' command to see who else and what else is running and what kind of resources are already being consumed. If, after starting a process, you realize that the server slows down considerably or becomes unusable, kill your processes and re-evaluate what you need to make things work. These servers are shared resources - be a good neighbor! == Server Types and Management == After confirming your VPN software is working, you can ssh into one of the shared compute servers behind the VPN. The DNS suffix for all machines is ".prism". So, "mustard" would have a full DNS name of "mustard.prism": {| class="wikitable" style="text-align:center;" |- style="font-weight:bold; text-align:left;" ! Node Name ! Operating System<br /> ! CPU Cores ! Memory ! Network Bandwidth ! Scratch Space |- | style="text-align:left;" | mustard | style="text-align:left;" | Ubuntu 22.04 | 160 | 1.5 TB | 10 Gb/s | 9 TB |- | style="text-align:left;" | emerald | style="text-align:left;" | Ubuntu 22.04 | 64 | 1 TB | 10 Gb/s | 690 GB |- | style="text-align:left;" | crimson | style="text-align:left;" | Ubuntu 22.04 | 32 | 256 GB | 10 Gb/s | 5.5 TB |- | style="text-align:left;" | razzmatazz | style="text-align:left;" | Ubuntu 22.04 | 32 | 256 GB | 10 Gb/s | 5.5 TB |} These ''shared'' servers are managed by the Genomics Institute Cluster Admin group. If you need software installed on any of these servers, please make your request by emailing cluster-admin@soe.ucsc.edu. == The Firewall == All servers are behind a firewall in this environment, and as such, you must connect to the VPN in order to access them. They will not be accessible from the greater Internet without VPN. Although you will be able to connect outbound from them to other servers on the internet to copy data in, sync git repos, stuff like that. It is only inbound connections that will be blocked. All machines behind the firewall have the private domain name suffix of "*.prism". == The Phoenix Cluster == This is a cluster of ~22 Ubuntu 22.04 nodes, some of which have GPUs in them. Each node generally has about 2TB RAM and 128 cores, although the cluster is heterogeneous and has multiple node types. You interact with the Phoenix Cluster via the Slurm Job Scheduler. You must specifically request access to use Slurm on the Phoenix Cluster, just email '''cluster-admin@soe.ucsc.edu''' for access. {| class="wikitable" style="text-align:center;" |- style="font-weight:bold;" ! Node Name ! Operating System<br /> ! CPU Cores ! GPUs/Type ! Memory ! Network Bandwidth ! Scratch Space |- | style="text-align:left;" | phoenix-00 | style="text-align:left;" | Ubuntu 22.04 | 256 | 8 / Nvidia A100 | 1 TB | 10 Gb/s | 16 TB NVMe |- | style="text-align:left;" | phoenix-[01-05] | style="text-align:left;" | Ubuntu 22.04 | 128 | 8 / Nvidia A5500 | 2 TB | 10 Gb/s | 16 TB NVMe |- | style="text-align:left;" | phoenix-[06-08] | style="text-align:left;" | Ubuntu 22.04 | 128 | N/A | 2 TB | 10 Gb/s | 16 TB NVMe |- | style="text-align:left;" | phoenix-[11-21] | style="text-align:left;" | Ubuntu 22.04 | 128 | N/A | 2 TB | 10 Gb/s | 16 TB NVMe |} The cluster head node, from which all jobs are submitted via the SLURM job scheduling framework, is '''phoenix.prism'''. To learn more about how to use Slurm, refer to: https://giwiki.gi.ucsc.edu/index.php/Genomics_Institute_Computing_Information#Slurm_at_the_Genomics_Institute For scratch on the cluster, TMPDIR will be set to /data/tmp. That area is cleaned often so don't store any data there that isn't being used by your jobs. 53abd383507f36562f8a9c873361105667cc3025 Phoenix WDL Tutorial 0 45 412 398 2023-08-10T15:06:42Z Anovak 4 /* Configuring Toil for Phoenix */ wikitext text/x-wiki =Tutorial: Getting Started with WDL Workflows on Phoenix= Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you. Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order. This tutorial will help you get started writing and running workflows on the Genomics Institute's Phoenix cluster. By the end, you will be able to run workflows on Phoenix with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong. ==Setup== Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH. ===Getting VPN access=== We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall. To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster. ===Connecting to Phoenix=== Once you have VPN access, you can connect to the "head node" of the Phoenix cluster. This node is where everyone logs in, but you should *not* run actual work on this node; it exists only to give you access to the files on the cluster and to the commands to control cluster jobs. To connect to the head node: 1. Connect to the VPN. 2. SSH to <code>phoenix.prism</code>. At the command line, run: ssh phoenix.prism If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run: ssh flastname@phoenix.prism The first time you connect, you will see a message like: The authenticity of host 'phoenix.prism (10.50.1.66)' can't be established. ED25519 key fingerprint is SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI. This key is not known by any other names Are you sure you want to continue connecting (yes/no/[fingerprint])? This is your computer asking you to help it decide if it is talking to the genuine <code>phoenix.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is. ===Installing Toil with WDL support=== Once you are on the head node, you can install Toil, a program for running workflows. When installing, you need to specify that you want WDL support. To do this, you can run: pip install --upgrade --user 'toil[wdl]' If you also want to use AWS S3 <code>s3://</code> and/or Google <code>gs://</code> URLs for data, you will need to also install Toil with the <code>aws</code> and <code>google</code> extras, respectively: pip install --upgrade --user 'toil[wdl,aws,google]' This will install Toil in the <code>.local</code> directory inside your home directory, which we write as <code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>. By default, the command interpreter *will not* look there, so if you type <code>toil-wdl-runner</code>, it will complain that the command is not found. To fix this, you need to configure the command interpreter (bash) to look where Toil is installed. To do this, run: echo 'export PATH="${HOME}/.local/bin:${PATH}"' >>~/.bashrc After that, **log out and log back in**, to restart bash and pick up the change. To make sure it worked, you can run: toil-wdl-runner --help If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports. If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above. ===Configuring Toil for Phoenix=== Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. So, use these commands to make sure that Toil knows where it ought to put its caches: echo 'export SINGULARITY_CACHEDIR="${HOME}/.singularity/cache"' >>~/.bashrc echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="${HOME}/.cache/miniwdl"' >>~/.bashrc After that, ''log out and log back in again'', to apply the changes. If you don't do this, Toil will re-download each container image, on each node, for each run of each workflow. That wastes a lot of time, and can exhaust the [https://docs.docker.com/docker-hub/download-rate-limit/#whats-the-download-rate-limit-on-docker-hub limits on how many containers you are allowed to download each day], so it is important not to skip this step. ==Running an existing workflow== First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project]. First, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally. wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl ===Preparing an input file=== Near the top of the WDL file, there's a section like this: workflow hello_caller { input { File who } This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one. So first, we have to make that list of names. Let's make it in <code>names.txt</code> echo "Mridula Resurrección" >names.txt echo "Gershom Šarlota" >>names.txt echo "Ritchie Ravi" >>names.txt Then, we need to create an *inputs file*, which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this: echo '{"hello_caller.who": "./names.txt"}' >inputs.json Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification]. ===Testing at small scale single-machine=== We are now ready to run the workflow! You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running: srun -c 2 --mem 8G --pty bash -i This will start a new shell; to leave it and go back to the head node you can use <code>exit</code>. In your new shell, run this Toil command: toil-wdl-runner self_test.wdl inputs.json -o local_run This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>. This will print a lot of logging to standard error, and to standard output it will print: {"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]} The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person. To leave your interactive Slurm session and return to the head node, use <code>exit</code>. ===Running at larger scale=== Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people! Go get this handy list of people and cut it to length: wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt head -n100 1000_names.txt >100_names.txt And make a new inputs file: echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster. To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it. mkdir -p logs toil-wdl-runner --jobStore ./big_store --batchSystem slurm --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory. ==Writing your own workflow== In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz]. ===Writing the file=== All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files. Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0. So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement: version 1.0 Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>. version 1.0 workflow FizzBuzz { } Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } } Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is *not* optional, and there is no default value, then the user's inputs file *must* specify a value for it in order for the workflow to run. Now we'll start on the body of the workflow, to be inserted just after the inputs section. The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable. Array[Int] numbers = range(item_count) WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more. Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 } Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array. Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only *once* in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>. Note that WDL supports conditional *expressions* with a <code>then</code> and an <code>else</code>, but conditional *statements* only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition. So first, let's handle the special cases. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. } } Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later. To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]) } We can put the code into the workflow now, and set about writing the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>. task stringify_number { } We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections. task stringify_number { input { Int the_number } # ??? output { String the_string # = ??? } } Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string # = ??? } } Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]). task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } } We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage. Then we can put our task into our WDL file: version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren's supposed to need this, but you do need it in 1.1 and Toil doesn't actually support not having one yet, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } output { Array[String] fizzbuzz_results = result } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array. Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs: echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json Then run it on the cluster with Toil: mkdir -p logs toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json Or locally: toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json ==Debugging Workflows== Sometimes, your workflow won't work. Try these ideas for figuring out what is going wrong. ===Debugging Options=== When debugging a workflow, make sure to run the workflow with <code>--logDebug</code>, to set the log level to <code>DEBUG</code>, and with <code>--jobStore /some/path/to/a/shared/directory/it/can/create</code> so that the stored files shipped between jobs are in a place you can access them. When debug logging is on, the log from every Toil job is inserted in the main Toil log between these markers: =========> Toil job log is here <========= Normally, only the logs of failing jobs are reproduced like this. ===Reading the Log=== When a WDL workflow fails, you are likely to see a message like this: WDL.runtime.error.CommandFailed: task command failed with exit status 1 [2023-07-16T16:23:54-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host phoenix-15.prism This means that the command line command specified by one of your WDL tasks exited with a failing (i.e. nonzero) exit code, which will happen when either the command line command is written wrong, or when the error detection code in the tool you are trying to run detects and reports an error. Go up higher in the log until you find lines that look like: [2023-07-16T16:23:54-0700] [MainThread] [I] [toil.wdl.wdltoil] Standard error at /data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/stderr.txt: And [2023-07-16T16:23:54-0700] [MainThread] [I] [toil.wdl.wdltoil] Standard output at /data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/stdout.txt: These will be followed by the standard error and standard output log data from the task's command. There may be useful information (such as an error message from the underlying tool) in there. ===Reproducing Problems=== When trying to fix a failing step, it is useful to be able to run a command outside of Toil or WDL that might reproduce the problem. In addition to getting the standard output and standard error logs as described above, you may also need input files for your tool in order to do this. In the log of your failing Toil taks, look for likes like this: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-4f886176ab8344baaf17dc72fc445445/toplog.sh' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmprwhi6h3q/toplog.sh' [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmpjyksfoko/Sample.bam' ... The <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam</code> part is a Toil file ID, and it is a relative path from your <code>--jobStore</code> value to where the file is stored on disk. So if you ran the workflow with </code>--jobStore /private/groups/patenlab/anovak/jobstore</code>, you would look for this file at: /private/groups/patenlab/anovak/jobstore/files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam ===More Ways of Finding Files=== Sometimes, a step might not fail, but you still might want to see the files it is using as input. If you have the job store path, you can use the <code>find</code> command to try and find the files by name. For example, if you want to look at <code>Sample.bam</code>, you can look for it like this: find /path/to/the/jobstore -name "Sample.bam" If you want to find files that were ''uploaded'' from a job, look for lines like this in the job's log: [2023-07-16T15:58:39-0700] [MainThread] [D] [toil.wdl.wdltoil] Virtualized /data/tmp/2846b6012e3e5535add03b363950dd78/cb23/197c/work/bamPerChrs/Sample.chr14.bam as WDL file toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam You can take the <code>toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam</code> URI and URL-decode it with, for example, [https://www.urldecoder.io/], getting this: toilfile:2703483274:0:files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam Then you can take the part after the last colon, <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam</code>, and that is the path relative to the job store where this file can be found. ===Using Development Versions of Toil=== Sometimes, bugs will be fixed in the development version of Toil, but not released yet. To try the current development version of Toil, you can install it like this: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl,aws,google]' If you want to use a particular branch or commit, like <code>aaa451b320fc115b3563ced25cb501301cf86f90</code>, you can do: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git@aaa451b320fc115b3563ced25cb501301cf86f90#egg=toil[wdl,aws,google]' ===Frequently Asked Questions=== ====I am getting warnings about <code>XDG_RUNTIME_DIR</code>==== You may be seeing warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code> This happens because Slurm is providing Toil with an <code>XDG_RUNTIME_DIR</code> environment variable that points to a directory that doesn't exist, which the XDG spec says it shouldn't be doing. This is a known bug in the GI Slurm configuration, and Toil is letting you know that it is working around it. ====Toil said it was <code>Redirecting logging</code> somewhere, but I can't find that file!==== Toil will say that it is <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code>, and when running in single machine mode these messages go to the main Toil log. The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil. If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node. ==Additional WDL resources== For more information on writing and running WDL workflows, see: * [https://docs.openwdl.org/en/stable/ The WDL dcoumentation] * [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube] * [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification] b7b2ccf8c4b9e0a9f4799c4d132bf17ee7bf04e1 413 412 2023-08-10T15:07:08Z Anovak 4 /* Configuring Toil for Phoenix */ wikitext text/x-wiki =Tutorial: Getting Started with WDL Workflows on Phoenix= Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you. Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order. This tutorial will help you get started writing and running workflows on the Genomics Institute's Phoenix cluster. By the end, you will be able to run workflows on Phoenix with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong. ==Setup== Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH. ===Getting VPN access=== We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall. To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster. ===Connecting to Phoenix=== Once you have VPN access, you can connect to the "head node" of the Phoenix cluster. This node is where everyone logs in, but you should *not* run actual work on this node; it exists only to give you access to the files on the cluster and to the commands to control cluster jobs. To connect to the head node: 1. Connect to the VPN. 2. SSH to <code>phoenix.prism</code>. At the command line, run: ssh phoenix.prism If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run: ssh flastname@phoenix.prism The first time you connect, you will see a message like: The authenticity of host 'phoenix.prism (10.50.1.66)' can't be established. ED25519 key fingerprint is SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI. This key is not known by any other names Are you sure you want to continue connecting (yes/no/[fingerprint])? This is your computer asking you to help it decide if it is talking to the genuine <code>phoenix.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is. ===Installing Toil with WDL support=== Once you are on the head node, you can install Toil, a program for running workflows. When installing, you need to specify that you want WDL support. To do this, you can run: pip install --upgrade --user 'toil[wdl]' If you also want to use AWS S3 <code>s3://</code> and/or Google <code>gs://</code> URLs for data, you will need to also install Toil with the <code>aws</code> and <code>google</code> extras, respectively: pip install --upgrade --user 'toil[wdl,aws,google]' This will install Toil in the <code>.local</code> directory inside your home directory, which we write as <code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>. By default, the command interpreter *will not* look there, so if you type <code>toil-wdl-runner</code>, it will complain that the command is not found. To fix this, you need to configure the command interpreter (bash) to look where Toil is installed. To do this, run: echo 'export PATH="${HOME}/.local/bin:${PATH}"' >>~/.bashrc After that, **log out and log back in**, to restart bash and pick up the change. To make sure it worked, you can run: toil-wdl-runner --help If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports. If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above. ===Configuring Toil for Phoenix=== Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. So, use these commands to make sure that Toil knows where it ought to put its caches: echo 'export SINGULARITY_CACHEDIR="${HOME}/.singularity/cache"' >>~/.bashrc echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="${HOME}/.cache/miniwdl"' >>~/.bashrc After that, '''log out and log back in again''', to apply the changes. If you don't do this, Toil will re-download each container image, on each node, for each run of each workflow. That wastes a lot of time, and can exhaust the [https://docs.docker.com/docker-hub/download-rate-limit/#whats-the-download-rate-limit-on-docker-hub limits on how many containers you are allowed to download each day], so it is important not to skip this step. ==Running an existing workflow== First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project]. First, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally. wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl ===Preparing an input file=== Near the top of the WDL file, there's a section like this: workflow hello_caller { input { File who } This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one. So first, we have to make that list of names. Let's make it in <code>names.txt</code> echo "Mridula Resurrección" >names.txt echo "Gershom Šarlota" >>names.txt echo "Ritchie Ravi" >>names.txt Then, we need to create an *inputs file*, which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this: echo '{"hello_caller.who": "./names.txt"}' >inputs.json Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification]. ===Testing at small scale single-machine=== We are now ready to run the workflow! You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running: srun -c 2 --mem 8G --pty bash -i This will start a new shell; to leave it and go back to the head node you can use <code>exit</code>. In your new shell, run this Toil command: toil-wdl-runner self_test.wdl inputs.json -o local_run This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>. This will print a lot of logging to standard error, and to standard output it will print: {"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]} The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person. To leave your interactive Slurm session and return to the head node, use <code>exit</code>. ===Running at larger scale=== Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people! Go get this handy list of people and cut it to length: wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt head -n100 1000_names.txt >100_names.txt And make a new inputs file: echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster. To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it. mkdir -p logs toil-wdl-runner --jobStore ./big_store --batchSystem slurm --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory. ==Writing your own workflow== In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz]. ===Writing the file=== All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files. Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0. So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement: version 1.0 Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>. version 1.0 workflow FizzBuzz { } Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } } Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is *not* optional, and there is no default value, then the user's inputs file *must* specify a value for it in order for the workflow to run. Now we'll start on the body of the workflow, to be inserted just after the inputs section. The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable. Array[Int] numbers = range(item_count) WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more. Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 } Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array. Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only *once* in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>. Note that WDL supports conditional *expressions* with a <code>then</code> and an <code>else</code>, but conditional *statements* only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition. So first, let's handle the special cases. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. } } Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later. To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]) } We can put the code into the workflow now, and set about writing the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>. task stringify_number { } We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections. task stringify_number { input { Int the_number } # ??? output { String the_string # = ??? } } Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string # = ??? } } Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]). task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } } We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage. Then we can put our task into our WDL file: version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren's supposed to need this, but you do need it in 1.1 and Toil doesn't actually support not having one yet, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } output { Array[String] fizzbuzz_results = result } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array. Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs: echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json Then run it on the cluster with Toil: mkdir -p logs toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json Or locally: toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json ==Debugging Workflows== Sometimes, your workflow won't work. Try these ideas for figuring out what is going wrong. ===Debugging Options=== When debugging a workflow, make sure to run the workflow with <code>--logDebug</code>, to set the log level to <code>DEBUG</code>, and with <code>--jobStore /some/path/to/a/shared/directory/it/can/create</code> so that the stored files shipped between jobs are in a place you can access them. When debug logging is on, the log from every Toil job is inserted in the main Toil log between these markers: =========> Toil job log is here <========= Normally, only the logs of failing jobs are reproduced like this. ===Reading the Log=== When a WDL workflow fails, you are likely to see a message like this: WDL.runtime.error.CommandFailed: task command failed with exit status 1 [2023-07-16T16:23:54-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host phoenix-15.prism This means that the command line command specified by one of your WDL tasks exited with a failing (i.e. nonzero) exit code, which will happen when either the command line command is written wrong, or when the error detection code in the tool you are trying to run detects and reports an error. Go up higher in the log until you find lines that look like: [2023-07-16T16:23:54-0700] [MainThread] [I] [toil.wdl.wdltoil] Standard error at /data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/stderr.txt: And [2023-07-16T16:23:54-0700] [MainThread] [I] [toil.wdl.wdltoil] Standard output at /data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/stdout.txt: These will be followed by the standard error and standard output log data from the task's command. There may be useful information (such as an error message from the underlying tool) in there. ===Reproducing Problems=== When trying to fix a failing step, it is useful to be able to run a command outside of Toil or WDL that might reproduce the problem. In addition to getting the standard output and standard error logs as described above, you may also need input files for your tool in order to do this. In the log of your failing Toil taks, look for likes like this: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-4f886176ab8344baaf17dc72fc445445/toplog.sh' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmprwhi6h3q/toplog.sh' [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmpjyksfoko/Sample.bam' ... The <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam</code> part is a Toil file ID, and it is a relative path from your <code>--jobStore</code> value to where the file is stored on disk. So if you ran the workflow with </code>--jobStore /private/groups/patenlab/anovak/jobstore</code>, you would look for this file at: /private/groups/patenlab/anovak/jobstore/files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam ===More Ways of Finding Files=== Sometimes, a step might not fail, but you still might want to see the files it is using as input. If you have the job store path, you can use the <code>find</code> command to try and find the files by name. For example, if you want to look at <code>Sample.bam</code>, you can look for it like this: find /path/to/the/jobstore -name "Sample.bam" If you want to find files that were ''uploaded'' from a job, look for lines like this in the job's log: [2023-07-16T15:58:39-0700] [MainThread] [D] [toil.wdl.wdltoil] Virtualized /data/tmp/2846b6012e3e5535add03b363950dd78/cb23/197c/work/bamPerChrs/Sample.chr14.bam as WDL file toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam You can take the <code>toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam</code> URI and URL-decode it with, for example, [https://www.urldecoder.io/], getting this: toilfile:2703483274:0:files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam Then you can take the part after the last colon, <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam</code>, and that is the path relative to the job store where this file can be found. ===Using Development Versions of Toil=== Sometimes, bugs will be fixed in the development version of Toil, but not released yet. To try the current development version of Toil, you can install it like this: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl,aws,google]' If you want to use a particular branch or commit, like <code>aaa451b320fc115b3563ced25cb501301cf86f90</code>, you can do: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git@aaa451b320fc115b3563ced25cb501301cf86f90#egg=toil[wdl,aws,google]' ===Frequently Asked Questions=== ====I am getting warnings about <code>XDG_RUNTIME_DIR</code>==== You may be seeing warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code> This happens because Slurm is providing Toil with an <code>XDG_RUNTIME_DIR</code> environment variable that points to a directory that doesn't exist, which the XDG spec says it shouldn't be doing. This is a known bug in the GI Slurm configuration, and Toil is letting you know that it is working around it. ====Toil said it was <code>Redirecting logging</code> somewhere, but I can't find that file!==== Toil will say that it is <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code>, and when running in single machine mode these messages go to the main Toil log. The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil. If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node. ==Additional WDL resources== For more information on writing and running WDL workflows, see: * [https://docs.openwdl.org/en/stable/ The WDL dcoumentation] * [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube] * [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification] d0dfcc4c14c0ada68994703d7dc0a7db1d78400b 414 413 2023-08-10T15:09:01Z Anovak 4 /* Writing your own workflow */ wikitext text/x-wiki =Tutorial: Getting Started with WDL Workflows on Phoenix= Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you. Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order. This tutorial will help you get started writing and running workflows on the Genomics Institute's Phoenix cluster. By the end, you will be able to run workflows on Phoenix with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong. ==Setup== Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH. ===Getting VPN access=== We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall. To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster. ===Connecting to Phoenix=== Once you have VPN access, you can connect to the "head node" of the Phoenix cluster. This node is where everyone logs in, but you should *not* run actual work on this node; it exists only to give you access to the files on the cluster and to the commands to control cluster jobs. To connect to the head node: 1. Connect to the VPN. 2. SSH to <code>phoenix.prism</code>. At the command line, run: ssh phoenix.prism If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run: ssh flastname@phoenix.prism The first time you connect, you will see a message like: The authenticity of host 'phoenix.prism (10.50.1.66)' can't be established. ED25519 key fingerprint is SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI. This key is not known by any other names Are you sure you want to continue connecting (yes/no/[fingerprint])? This is your computer asking you to help it decide if it is talking to the genuine <code>phoenix.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is. ===Installing Toil with WDL support=== Once you are on the head node, you can install Toil, a program for running workflows. When installing, you need to specify that you want WDL support. To do this, you can run: pip install --upgrade --user 'toil[wdl]' If you also want to use AWS S3 <code>s3://</code> and/or Google <code>gs://</code> URLs for data, you will need to also install Toil with the <code>aws</code> and <code>google</code> extras, respectively: pip install --upgrade --user 'toil[wdl,aws,google]' This will install Toil in the <code>.local</code> directory inside your home directory, which we write as <code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>. By default, the command interpreter *will not* look there, so if you type <code>toil-wdl-runner</code>, it will complain that the command is not found. To fix this, you need to configure the command interpreter (bash) to look where Toil is installed. To do this, run: echo 'export PATH="${HOME}/.local/bin:${PATH}"' >>~/.bashrc After that, **log out and log back in**, to restart bash and pick up the change. To make sure it worked, you can run: toil-wdl-runner --help If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports. If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above. ===Configuring Toil for Phoenix=== Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. So, use these commands to make sure that Toil knows where it ought to put its caches: echo 'export SINGULARITY_CACHEDIR="${HOME}/.singularity/cache"' >>~/.bashrc echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="${HOME}/.cache/miniwdl"' >>~/.bashrc After that, '''log out and log back in again''', to apply the changes. If you don't do this, Toil will re-download each container image, on each node, for each run of each workflow. That wastes a lot of time, and can exhaust the [https://docs.docker.com/docker-hub/download-rate-limit/#whats-the-download-rate-limit-on-docker-hub limits on how many containers you are allowed to download each day], so it is important not to skip this step. ==Running an existing workflow== First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project]. First, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally. wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl ===Preparing an input file=== Near the top of the WDL file, there's a section like this: workflow hello_caller { input { File who } This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one. So first, we have to make that list of names. Let's make it in <code>names.txt</code> echo "Mridula Resurrección" >names.txt echo "Gershom Šarlota" >>names.txt echo "Ritchie Ravi" >>names.txt Then, we need to create an *inputs file*, which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this: echo '{"hello_caller.who": "./names.txt"}' >inputs.json Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification]. ===Testing at small scale single-machine=== We are now ready to run the workflow! You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running: srun -c 2 --mem 8G --pty bash -i This will start a new shell; to leave it and go back to the head node you can use <code>exit</code>. In your new shell, run this Toil command: toil-wdl-runner self_test.wdl inputs.json -o local_run This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>. This will print a lot of logging to standard error, and to standard output it will print: {"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]} The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person. To leave your interactive Slurm session and return to the head node, use <code>exit</code>. ===Running at larger scale=== Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people! Go get this handy list of people and cut it to length: wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt head -n100 1000_names.txt >100_names.txt And make a new inputs file: echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster. To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it. mkdir -p logs toil-wdl-runner --jobStore ./big_store --batchSystem slurm --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory. ==Writing your own workflow== In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz]. ===Writing the file=== All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files). Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0. So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement: version 1.0 Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>. version 1.0 workflow FizzBuzz { } Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } } Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is *not* optional, and there is no default value, then the user's inputs file *must* specify a value for it in order for the workflow to run. Now we'll start on the body of the workflow, to be inserted just after the inputs section. The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable. Array[Int] numbers = range(item_count) WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more. Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 } Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array. Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only *once* in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>. Note that WDL supports conditional *expressions* with a <code>then</code> and an <code>else</code>, but conditional *statements* only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition. So first, let's handle the special cases. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. } } Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later. To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]) } We can put the code into the workflow now, and set about writing the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>. task stringify_number { } We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections. task stringify_number { input { Int the_number } # ??? output { String the_string # = ??? } } Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string # = ??? } } Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]). task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } } We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage. Then we can put our task into our WDL file: version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren's supposed to need this, but you do need it in 1.1 and Toil doesn't actually support not having one yet, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } output { Array[String] fizzbuzz_results = result } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array. Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs: echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json Then run it on the cluster with Toil: mkdir -p logs toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json Or locally: toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json ==Debugging Workflows== Sometimes, your workflow won't work. Try these ideas for figuring out what is going wrong. ===Debugging Options=== When debugging a workflow, make sure to run the workflow with <code>--logDebug</code>, to set the log level to <code>DEBUG</code>, and with <code>--jobStore /some/path/to/a/shared/directory/it/can/create</code> so that the stored files shipped between jobs are in a place you can access them. When debug logging is on, the log from every Toil job is inserted in the main Toil log between these markers: =========> Toil job log is here <========= Normally, only the logs of failing jobs are reproduced like this. ===Reading the Log=== When a WDL workflow fails, you are likely to see a message like this: WDL.runtime.error.CommandFailed: task command failed with exit status 1 [2023-07-16T16:23:54-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host phoenix-15.prism This means that the command line command specified by one of your WDL tasks exited with a failing (i.e. nonzero) exit code, which will happen when either the command line command is written wrong, or when the error detection code in the tool you are trying to run detects and reports an error. Go up higher in the log until you find lines that look like: [2023-07-16T16:23:54-0700] [MainThread] [I] [toil.wdl.wdltoil] Standard error at /data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/stderr.txt: And [2023-07-16T16:23:54-0700] [MainThread] [I] [toil.wdl.wdltoil] Standard output at /data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/stdout.txt: These will be followed by the standard error and standard output log data from the task's command. There may be useful information (such as an error message from the underlying tool) in there. ===Reproducing Problems=== When trying to fix a failing step, it is useful to be able to run a command outside of Toil or WDL that might reproduce the problem. In addition to getting the standard output and standard error logs as described above, you may also need input files for your tool in order to do this. In the log of your failing Toil taks, look for likes like this: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-4f886176ab8344baaf17dc72fc445445/toplog.sh' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmprwhi6h3q/toplog.sh' [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmpjyksfoko/Sample.bam' ... The <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam</code> part is a Toil file ID, and it is a relative path from your <code>--jobStore</code> value to where the file is stored on disk. So if you ran the workflow with </code>--jobStore /private/groups/patenlab/anovak/jobstore</code>, you would look for this file at: /private/groups/patenlab/anovak/jobstore/files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam ===More Ways of Finding Files=== Sometimes, a step might not fail, but you still might want to see the files it is using as input. If you have the job store path, you can use the <code>find</code> command to try and find the files by name. For example, if you want to look at <code>Sample.bam</code>, you can look for it like this: find /path/to/the/jobstore -name "Sample.bam" If you want to find files that were ''uploaded'' from a job, look for lines like this in the job's log: [2023-07-16T15:58:39-0700] [MainThread] [D] [toil.wdl.wdltoil] Virtualized /data/tmp/2846b6012e3e5535add03b363950dd78/cb23/197c/work/bamPerChrs/Sample.chr14.bam as WDL file toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam You can take the <code>toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam</code> URI and URL-decode it with, for example, [https://www.urldecoder.io/], getting this: toilfile:2703483274:0:files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam Then you can take the part after the last colon, <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam</code>, and that is the path relative to the job store where this file can be found. ===Using Development Versions of Toil=== Sometimes, bugs will be fixed in the development version of Toil, but not released yet. To try the current development version of Toil, you can install it like this: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl,aws,google]' If you want to use a particular branch or commit, like <code>aaa451b320fc115b3563ced25cb501301cf86f90</code>, you can do: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git@aaa451b320fc115b3563ced25cb501301cf86f90#egg=toil[wdl,aws,google]' ===Frequently Asked Questions=== ====I am getting warnings about <code>XDG_RUNTIME_DIR</code>==== You may be seeing warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code> This happens because Slurm is providing Toil with an <code>XDG_RUNTIME_DIR</code> environment variable that points to a directory that doesn't exist, which the XDG spec says it shouldn't be doing. This is a known bug in the GI Slurm configuration, and Toil is letting you know that it is working around it. ====Toil said it was <code>Redirecting logging</code> somewhere, but I can't find that file!==== Toil will say that it is <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code>, and when running in single machine mode these messages go to the main Toil log. The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil. If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node. ==Additional WDL resources== For more information on writing and running WDL workflows, see: * [https://docs.openwdl.org/en/stable/ The WDL dcoumentation] * [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube] * [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification] 482197265102648915b14fb00ac010b3e9049e43 415 414 2023-08-10T15:11:04Z Anovak 4 /* Writing the file */ wikitext text/x-wiki =Tutorial: Getting Started with WDL Workflows on Phoenix= Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you. Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order. This tutorial will help you get started writing and running workflows on the Genomics Institute's Phoenix cluster. By the end, you will be able to run workflows on Phoenix with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong. ==Setup== Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH. ===Getting VPN access=== We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall. To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster. ===Connecting to Phoenix=== Once you have VPN access, you can connect to the "head node" of the Phoenix cluster. This node is where everyone logs in, but you should *not* run actual work on this node; it exists only to give you access to the files on the cluster and to the commands to control cluster jobs. To connect to the head node: 1. Connect to the VPN. 2. SSH to <code>phoenix.prism</code>. At the command line, run: ssh phoenix.prism If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run: ssh flastname@phoenix.prism The first time you connect, you will see a message like: The authenticity of host 'phoenix.prism (10.50.1.66)' can't be established. ED25519 key fingerprint is SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI. This key is not known by any other names Are you sure you want to continue connecting (yes/no/[fingerprint])? This is your computer asking you to help it decide if it is talking to the genuine <code>phoenix.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is. ===Installing Toil with WDL support=== Once you are on the head node, you can install Toil, a program for running workflows. When installing, you need to specify that you want WDL support. To do this, you can run: pip install --upgrade --user 'toil[wdl]' If you also want to use AWS S3 <code>s3://</code> and/or Google <code>gs://</code> URLs for data, you will need to also install Toil with the <code>aws</code> and <code>google</code> extras, respectively: pip install --upgrade --user 'toil[wdl,aws,google]' This will install Toil in the <code>.local</code> directory inside your home directory, which we write as <code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>. By default, the command interpreter *will not* look there, so if you type <code>toil-wdl-runner</code>, it will complain that the command is not found. To fix this, you need to configure the command interpreter (bash) to look where Toil is installed. To do this, run: echo 'export PATH="${HOME}/.local/bin:${PATH}"' >>~/.bashrc After that, **log out and log back in**, to restart bash and pick up the change. To make sure it worked, you can run: toil-wdl-runner --help If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports. If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above. ===Configuring Toil for Phoenix=== Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. So, use these commands to make sure that Toil knows where it ought to put its caches: echo 'export SINGULARITY_CACHEDIR="${HOME}/.singularity/cache"' >>~/.bashrc echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="${HOME}/.cache/miniwdl"' >>~/.bashrc After that, '''log out and log back in again''', to apply the changes. If you don't do this, Toil will re-download each container image, on each node, for each run of each workflow. That wastes a lot of time, and can exhaust the [https://docs.docker.com/docker-hub/download-rate-limit/#whats-the-download-rate-limit-on-docker-hub limits on how many containers you are allowed to download each day], so it is important not to skip this step. ==Running an existing workflow== First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project]. First, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally. wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl ===Preparing an input file=== Near the top of the WDL file, there's a section like this: workflow hello_caller { input { File who } This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one. So first, we have to make that list of names. Let's make it in <code>names.txt</code> echo "Mridula Resurrección" >names.txt echo "Gershom Šarlota" >>names.txt echo "Ritchie Ravi" >>names.txt Then, we need to create an *inputs file*, which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this: echo '{"hello_caller.who": "./names.txt"}' >inputs.json Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification]. ===Testing at small scale single-machine=== We are now ready to run the workflow! You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running: srun -c 2 --mem 8G --pty bash -i This will start a new shell; to leave it and go back to the head node you can use <code>exit</code>. In your new shell, run this Toil command: toil-wdl-runner self_test.wdl inputs.json -o local_run This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>. This will print a lot of logging to standard error, and to standard output it will print: {"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]} The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person. To leave your interactive Slurm session and return to the head node, use <code>exit</code>. ===Running at larger scale=== Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people! Go get this handy list of people and cut it to length: wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt head -n100 1000_names.txt >100_names.txt And make a new inputs file: echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster. To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it. mkdir -p logs toil-wdl-runner --jobStore ./big_store --batchSystem slurm --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory. ==Writing your own workflow== In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz]. ===Writing the file=== All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files). Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0. So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement: version 1.0 Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>. version 1.0 workflow FizzBuzz { } Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } } Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is ''not'' optional, and there is no default value, then the user's inputs file ''must'' specify a value for it in order for the workflow to run. Now we'll start on the body of the workflow, to be inserted just after the inputs section. The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable. Array[Int] numbers = range(item_count) WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more. Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 } Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array. Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only *once* in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>. Note that WDL supports conditional ''expressions'' with a <code>then</code> and an <code>else</code>, but conditional ''statements'' only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition. So first, let's handle the special cases. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. } } Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later. To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]) } We can put the code into the workflow now, and set about writing the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>. task stringify_number { } We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections. task stringify_number { input { Int the_number } # ??? output { String the_string # = ??? } } Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string # = ??? } } Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]). task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } } We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage. Then we can put our task into our WDL file: version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren't supposed to need this, but you do need it in 1.1 and Toil doesn't actually send your outputs anywhere yet if you don't have one, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } output { Array[String] fizzbuzz_results = result } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array. Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs: echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json Then run it on the cluster with Toil: mkdir -p logs toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json Or locally: toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json ==Debugging Workflows== Sometimes, your workflow won't work. Try these ideas for figuring out what is going wrong. ===Debugging Options=== When debugging a workflow, make sure to run the workflow with <code>--logDebug</code>, to set the log level to <code>DEBUG</code>, and with <code>--jobStore /some/path/to/a/shared/directory/it/can/create</code> so that the stored files shipped between jobs are in a place you can access them. When debug logging is on, the log from every Toil job is inserted in the main Toil log between these markers: =========> Toil job log is here <========= Normally, only the logs of failing jobs are reproduced like this. ===Reading the Log=== When a WDL workflow fails, you are likely to see a message like this: WDL.runtime.error.CommandFailed: task command failed with exit status 1 [2023-07-16T16:23:54-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host phoenix-15.prism This means that the command line command specified by one of your WDL tasks exited with a failing (i.e. nonzero) exit code, which will happen when either the command line command is written wrong, or when the error detection code in the tool you are trying to run detects and reports an error. Go up higher in the log until you find lines that look like: [2023-07-16T16:23:54-0700] [MainThread] [I] [toil.wdl.wdltoil] Standard error at /data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/stderr.txt: And [2023-07-16T16:23:54-0700] [MainThread] [I] [toil.wdl.wdltoil] Standard output at /data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/stdout.txt: These will be followed by the standard error and standard output log data from the task's command. There may be useful information (such as an error message from the underlying tool) in there. ===Reproducing Problems=== When trying to fix a failing step, it is useful to be able to run a command outside of Toil or WDL that might reproduce the problem. In addition to getting the standard output and standard error logs as described above, you may also need input files for your tool in order to do this. In the log of your failing Toil taks, look for likes like this: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-4f886176ab8344baaf17dc72fc445445/toplog.sh' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmprwhi6h3q/toplog.sh' [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmpjyksfoko/Sample.bam' ... The <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam</code> part is a Toil file ID, and it is a relative path from your <code>--jobStore</code> value to where the file is stored on disk. So if you ran the workflow with </code>--jobStore /private/groups/patenlab/anovak/jobstore</code>, you would look for this file at: /private/groups/patenlab/anovak/jobstore/files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam ===More Ways of Finding Files=== Sometimes, a step might not fail, but you still might want to see the files it is using as input. If you have the job store path, you can use the <code>find</code> command to try and find the files by name. For example, if you want to look at <code>Sample.bam</code>, you can look for it like this: find /path/to/the/jobstore -name "Sample.bam" If you want to find files that were ''uploaded'' from a job, look for lines like this in the job's log: [2023-07-16T15:58:39-0700] [MainThread] [D] [toil.wdl.wdltoil] Virtualized /data/tmp/2846b6012e3e5535add03b363950dd78/cb23/197c/work/bamPerChrs/Sample.chr14.bam as WDL file toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam You can take the <code>toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam</code> URI and URL-decode it with, for example, [https://www.urldecoder.io/], getting this: toilfile:2703483274:0:files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam Then you can take the part after the last colon, <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam</code>, and that is the path relative to the job store where this file can be found. ===Using Development Versions of Toil=== Sometimes, bugs will be fixed in the development version of Toil, but not released yet. To try the current development version of Toil, you can install it like this: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl,aws,google]' If you want to use a particular branch or commit, like <code>aaa451b320fc115b3563ced25cb501301cf86f90</code>, you can do: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git@aaa451b320fc115b3563ced25cb501301cf86f90#egg=toil[wdl,aws,google]' ===Frequently Asked Questions=== ====I am getting warnings about <code>XDG_RUNTIME_DIR</code>==== You may be seeing warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code> This happens because Slurm is providing Toil with an <code>XDG_RUNTIME_DIR</code> environment variable that points to a directory that doesn't exist, which the XDG spec says it shouldn't be doing. This is a known bug in the GI Slurm configuration, and Toil is letting you know that it is working around it. ====Toil said it was <code>Redirecting logging</code> somewhere, but I can't find that file!==== Toil will say that it is <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code>, and when running in single machine mode these messages go to the main Toil log. The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil. If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node. ==Additional WDL resources== For more information on writing and running WDL workflows, see: * [https://docs.openwdl.org/en/stable/ The WDL dcoumentation] * [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube] * [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification] 8ea9f68e9f63adeeab355f55d70b7538352f7b5f 416 415 2023-08-10T15:11:40Z Anovak 4 /* Writing the file */ wikitext text/x-wiki =Tutorial: Getting Started with WDL Workflows on Phoenix= Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you. Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order. This tutorial will help you get started writing and running workflows on the Genomics Institute's Phoenix cluster. By the end, you will be able to run workflows on Phoenix with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong. ==Setup== Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH. ===Getting VPN access=== We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall. To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster. ===Connecting to Phoenix=== Once you have VPN access, you can connect to the "head node" of the Phoenix cluster. This node is where everyone logs in, but you should *not* run actual work on this node; it exists only to give you access to the files on the cluster and to the commands to control cluster jobs. To connect to the head node: 1. Connect to the VPN. 2. SSH to <code>phoenix.prism</code>. At the command line, run: ssh phoenix.prism If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run: ssh flastname@phoenix.prism The first time you connect, you will see a message like: The authenticity of host 'phoenix.prism (10.50.1.66)' can't be established. ED25519 key fingerprint is SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI. This key is not known by any other names Are you sure you want to continue connecting (yes/no/[fingerprint])? This is your computer asking you to help it decide if it is talking to the genuine <code>phoenix.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is. ===Installing Toil with WDL support=== Once you are on the head node, you can install Toil, a program for running workflows. When installing, you need to specify that you want WDL support. To do this, you can run: pip install --upgrade --user 'toil[wdl]' If you also want to use AWS S3 <code>s3://</code> and/or Google <code>gs://</code> URLs for data, you will need to also install Toil with the <code>aws</code> and <code>google</code> extras, respectively: pip install --upgrade --user 'toil[wdl,aws,google]' This will install Toil in the <code>.local</code> directory inside your home directory, which we write as <code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>. By default, the command interpreter *will not* look there, so if you type <code>toil-wdl-runner</code>, it will complain that the command is not found. To fix this, you need to configure the command interpreter (bash) to look where Toil is installed. To do this, run: echo 'export PATH="${HOME}/.local/bin:${PATH}"' >>~/.bashrc After that, **log out and log back in**, to restart bash and pick up the change. To make sure it worked, you can run: toil-wdl-runner --help If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports. If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above. ===Configuring Toil for Phoenix=== Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. So, use these commands to make sure that Toil knows where it ought to put its caches: echo 'export SINGULARITY_CACHEDIR="${HOME}/.singularity/cache"' >>~/.bashrc echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="${HOME}/.cache/miniwdl"' >>~/.bashrc After that, '''log out and log back in again''', to apply the changes. If you don't do this, Toil will re-download each container image, on each node, for each run of each workflow. That wastes a lot of time, and can exhaust the [https://docs.docker.com/docker-hub/download-rate-limit/#whats-the-download-rate-limit-on-docker-hub limits on how many containers you are allowed to download each day], so it is important not to skip this step. ==Running an existing workflow== First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project]. First, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally. wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl ===Preparing an input file=== Near the top of the WDL file, there's a section like this: workflow hello_caller { input { File who } This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one. So first, we have to make that list of names. Let's make it in <code>names.txt</code> echo "Mridula Resurrección" >names.txt echo "Gershom Šarlota" >>names.txt echo "Ritchie Ravi" >>names.txt Then, we need to create an *inputs file*, which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this: echo '{"hello_caller.who": "./names.txt"}' >inputs.json Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification]. ===Testing at small scale single-machine=== We are now ready to run the workflow! You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running: srun -c 2 --mem 8G --pty bash -i This will start a new shell; to leave it and go back to the head node you can use <code>exit</code>. In your new shell, run this Toil command: toil-wdl-runner self_test.wdl inputs.json -o local_run This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>. This will print a lot of logging to standard error, and to standard output it will print: {"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]} The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person. To leave your interactive Slurm session and return to the head node, use <code>exit</code>. ===Running at larger scale=== Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people! Go get this handy list of people and cut it to length: wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt head -n100 1000_names.txt >100_names.txt And make a new inputs file: echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster. To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it. mkdir -p logs toil-wdl-runner --jobStore ./big_store --batchSystem slurm --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory. ==Writing your own workflow== In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz]. ===Writing the file=== All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files). Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0. So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement: version 1.0 Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>. version 1.0 workflow FizzBuzz { } Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } } Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is ''not'' optional, and there is no default value, then the user's inputs file ''must'' specify a value for it in order for the workflow to run. Now we'll start on the body of the workflow, to be inserted just after the inputs section. The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable. Array[Int] numbers = range(item_count) WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more. Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 } Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array. Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only ''once'' in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>. Note that WDL supports conditional ''expressions'' with a <code>then</code> and an <code>else</code>, but conditional ''statements'' only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition. So first, let's handle the special cases. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. } } Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later. To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]) } We can put the code into the workflow now, and set about writing the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>. task stringify_number { } We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections. task stringify_number { input { Int the_number } # ??? output { String the_string # = ??? } } Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string # = ??? } } Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]). task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } } We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage. Then we can put our task into our WDL file: version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren't supposed to need this, but you do need it in 1.1 and Toil doesn't actually send your outputs anywhere yet if you don't have one, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } output { Array[String] fizzbuzz_results = result } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array. Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs: echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json Then run it on the cluster with Toil: mkdir -p logs toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json Or locally: toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json ==Debugging Workflows== Sometimes, your workflow won't work. Try these ideas for figuring out what is going wrong. ===Debugging Options=== When debugging a workflow, make sure to run the workflow with <code>--logDebug</code>, to set the log level to <code>DEBUG</code>, and with <code>--jobStore /some/path/to/a/shared/directory/it/can/create</code> so that the stored files shipped between jobs are in a place you can access them. When debug logging is on, the log from every Toil job is inserted in the main Toil log between these markers: =========> Toil job log is here <========= Normally, only the logs of failing jobs are reproduced like this. ===Reading the Log=== When a WDL workflow fails, you are likely to see a message like this: WDL.runtime.error.CommandFailed: task command failed with exit status 1 [2023-07-16T16:23:54-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host phoenix-15.prism This means that the command line command specified by one of your WDL tasks exited with a failing (i.e. nonzero) exit code, which will happen when either the command line command is written wrong, or when the error detection code in the tool you are trying to run detects and reports an error. Go up higher in the log until you find lines that look like: [2023-07-16T16:23:54-0700] [MainThread] [I] [toil.wdl.wdltoil] Standard error at /data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/stderr.txt: And [2023-07-16T16:23:54-0700] [MainThread] [I] [toil.wdl.wdltoil] Standard output at /data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/stdout.txt: These will be followed by the standard error and standard output log data from the task's command. There may be useful information (such as an error message from the underlying tool) in there. ===Reproducing Problems=== When trying to fix a failing step, it is useful to be able to run a command outside of Toil or WDL that might reproduce the problem. In addition to getting the standard output and standard error logs as described above, you may also need input files for your tool in order to do this. In the log of your failing Toil taks, look for likes like this: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-4f886176ab8344baaf17dc72fc445445/toplog.sh' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmprwhi6h3q/toplog.sh' [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmpjyksfoko/Sample.bam' ... The <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam</code> part is a Toil file ID, and it is a relative path from your <code>--jobStore</code> value to where the file is stored on disk. So if you ran the workflow with </code>--jobStore /private/groups/patenlab/anovak/jobstore</code>, you would look for this file at: /private/groups/patenlab/anovak/jobstore/files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam ===More Ways of Finding Files=== Sometimes, a step might not fail, but you still might want to see the files it is using as input. If you have the job store path, you can use the <code>find</code> command to try and find the files by name. For example, if you want to look at <code>Sample.bam</code>, you can look for it like this: find /path/to/the/jobstore -name "Sample.bam" If you want to find files that were ''uploaded'' from a job, look for lines like this in the job's log: [2023-07-16T15:58:39-0700] [MainThread] [D] [toil.wdl.wdltoil] Virtualized /data/tmp/2846b6012e3e5535add03b363950dd78/cb23/197c/work/bamPerChrs/Sample.chr14.bam as WDL file toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam You can take the <code>toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam</code> URI and URL-decode it with, for example, [https://www.urldecoder.io/], getting this: toilfile:2703483274:0:files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam Then you can take the part after the last colon, <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam</code>, and that is the path relative to the job store where this file can be found. ===Using Development Versions of Toil=== Sometimes, bugs will be fixed in the development version of Toil, but not released yet. To try the current development version of Toil, you can install it like this: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl,aws,google]' If you want to use a particular branch or commit, like <code>aaa451b320fc115b3563ced25cb501301cf86f90</code>, you can do: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git@aaa451b320fc115b3563ced25cb501301cf86f90#egg=toil[wdl,aws,google]' ===Frequently Asked Questions=== ====I am getting warnings about <code>XDG_RUNTIME_DIR</code>==== You may be seeing warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code> This happens because Slurm is providing Toil with an <code>XDG_RUNTIME_DIR</code> environment variable that points to a directory that doesn't exist, which the XDG spec says it shouldn't be doing. This is a known bug in the GI Slurm configuration, and Toil is letting you know that it is working around it. ====Toil said it was <code>Redirecting logging</code> somewhere, but I can't find that file!==== Toil will say that it is <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code>, and when running in single machine mode these messages go to the main Toil log. The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil. If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node. ==Additional WDL resources== For more information on writing and running WDL workflows, see: * [https://docs.openwdl.org/en/stable/ The WDL dcoumentation] * [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube] * [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification] 5eb91bfb9b19a9b8a4f8836865f7e440feeebb4f 417 416 2023-08-10T15:13:11Z Anovak 4 /* Reproducing Problems */ wikitext text/x-wiki =Tutorial: Getting Started with WDL Workflows on Phoenix= Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you. Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order. This tutorial will help you get started writing and running workflows on the Genomics Institute's Phoenix cluster. By the end, you will be able to run workflows on Phoenix with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong. ==Setup== Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH. ===Getting VPN access=== We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall. To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster. ===Connecting to Phoenix=== Once you have VPN access, you can connect to the "head node" of the Phoenix cluster. This node is where everyone logs in, but you should *not* run actual work on this node; it exists only to give you access to the files on the cluster and to the commands to control cluster jobs. To connect to the head node: 1. Connect to the VPN. 2. SSH to <code>phoenix.prism</code>. At the command line, run: ssh phoenix.prism If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run: ssh flastname@phoenix.prism The first time you connect, you will see a message like: The authenticity of host 'phoenix.prism (10.50.1.66)' can't be established. ED25519 key fingerprint is SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI. This key is not known by any other names Are you sure you want to continue connecting (yes/no/[fingerprint])? This is your computer asking you to help it decide if it is talking to the genuine <code>phoenix.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is. ===Installing Toil with WDL support=== Once you are on the head node, you can install Toil, a program for running workflows. When installing, you need to specify that you want WDL support. To do this, you can run: pip install --upgrade --user 'toil[wdl]' If you also want to use AWS S3 <code>s3://</code> and/or Google <code>gs://</code> URLs for data, you will need to also install Toil with the <code>aws</code> and <code>google</code> extras, respectively: pip install --upgrade --user 'toil[wdl,aws,google]' This will install Toil in the <code>.local</code> directory inside your home directory, which we write as <code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>. By default, the command interpreter *will not* look there, so if you type <code>toil-wdl-runner</code>, it will complain that the command is not found. To fix this, you need to configure the command interpreter (bash) to look where Toil is installed. To do this, run: echo 'export PATH="${HOME}/.local/bin:${PATH}"' >>~/.bashrc After that, **log out and log back in**, to restart bash and pick up the change. To make sure it worked, you can run: toil-wdl-runner --help If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports. If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above. ===Configuring Toil for Phoenix=== Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. So, use these commands to make sure that Toil knows where it ought to put its caches: echo 'export SINGULARITY_CACHEDIR="${HOME}/.singularity/cache"' >>~/.bashrc echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="${HOME}/.cache/miniwdl"' >>~/.bashrc After that, '''log out and log back in again''', to apply the changes. If you don't do this, Toil will re-download each container image, on each node, for each run of each workflow. That wastes a lot of time, and can exhaust the [https://docs.docker.com/docker-hub/download-rate-limit/#whats-the-download-rate-limit-on-docker-hub limits on how many containers you are allowed to download each day], so it is important not to skip this step. ==Running an existing workflow== First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project]. First, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally. wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl ===Preparing an input file=== Near the top of the WDL file, there's a section like this: workflow hello_caller { input { File who } This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one. So first, we have to make that list of names. Let's make it in <code>names.txt</code> echo "Mridula Resurrección" >names.txt echo "Gershom Šarlota" >>names.txt echo "Ritchie Ravi" >>names.txt Then, we need to create an *inputs file*, which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this: echo '{"hello_caller.who": "./names.txt"}' >inputs.json Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification]. ===Testing at small scale single-machine=== We are now ready to run the workflow! You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running: srun -c 2 --mem 8G --pty bash -i This will start a new shell; to leave it and go back to the head node you can use <code>exit</code>. In your new shell, run this Toil command: toil-wdl-runner self_test.wdl inputs.json -o local_run This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>. This will print a lot of logging to standard error, and to standard output it will print: {"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]} The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person. To leave your interactive Slurm session and return to the head node, use <code>exit</code>. ===Running at larger scale=== Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people! Go get this handy list of people and cut it to length: wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt head -n100 1000_names.txt >100_names.txt And make a new inputs file: echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster. To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it. mkdir -p logs toil-wdl-runner --jobStore ./big_store --batchSystem slurm --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory. ==Writing your own workflow== In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz]. ===Writing the file=== All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files). Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0. So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement: version 1.0 Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>. version 1.0 workflow FizzBuzz { } Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } } Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is ''not'' optional, and there is no default value, then the user's inputs file ''must'' specify a value for it in order for the workflow to run. Now we'll start on the body of the workflow, to be inserted just after the inputs section. The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable. Array[Int] numbers = range(item_count) WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more. Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 } Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array. Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only ''once'' in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>. Note that WDL supports conditional ''expressions'' with a <code>then</code> and an <code>else</code>, but conditional ''statements'' only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition. So first, let's handle the special cases. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. } } Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later. To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]) } We can put the code into the workflow now, and set about writing the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>. task stringify_number { } We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections. task stringify_number { input { Int the_number } # ??? output { String the_string # = ??? } } Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string # = ??? } } Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]). task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } } We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage. Then we can put our task into our WDL file: version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren't supposed to need this, but you do need it in 1.1 and Toil doesn't actually send your outputs anywhere yet if you don't have one, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } output { Array[String] fizzbuzz_results = result } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array. Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs: echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json Then run it on the cluster with Toil: mkdir -p logs toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json Or locally: toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json ==Debugging Workflows== Sometimes, your workflow won't work. Try these ideas for figuring out what is going wrong. ===Debugging Options=== When debugging a workflow, make sure to run the workflow with <code>--logDebug</code>, to set the log level to <code>DEBUG</code>, and with <code>--jobStore /some/path/to/a/shared/directory/it/can/create</code> so that the stored files shipped between jobs are in a place you can access them. When debug logging is on, the log from every Toil job is inserted in the main Toil log between these markers: =========> Toil job log is here <========= Normally, only the logs of failing jobs are reproduced like this. ===Reading the Log=== When a WDL workflow fails, you are likely to see a message like this: WDL.runtime.error.CommandFailed: task command failed with exit status 1 [2023-07-16T16:23:54-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host phoenix-15.prism This means that the command line command specified by one of your WDL tasks exited with a failing (i.e. nonzero) exit code, which will happen when either the command line command is written wrong, or when the error detection code in the tool you are trying to run detects and reports an error. Go up higher in the log until you find lines that look like: [2023-07-16T16:23:54-0700] [MainThread] [I] [toil.wdl.wdltoil] Standard error at /data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/stderr.txt: And [2023-07-16T16:23:54-0700] [MainThread] [I] [toil.wdl.wdltoil] Standard output at /data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/stdout.txt: These will be followed by the standard error and standard output log data from the task's command. There may be useful information (such as an error message from the underlying tool) in there. ===Reproducing Problems=== When trying to fix a failing step, it is useful to be able to run a command outside of Toil or WDL that might reproduce the problem. In addition to getting the standard output and standard error logs as described above, you may also need input files for your tool in order to do this. In the log of your failing Toil taks, look for likes like this: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-4f886176ab8344baaf17dc72fc445445/toplog.sh' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmprwhi6h3q/toplog.sh' [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmpjyksfoko/Sample.bam' ... The <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam</code> part is a Toil file ID, and it is a relative path from your <code>--jobStore</code> value to where the file is stored on disk. So if you ran the workflow with <code>--jobStore /private/groups/patenlab/anovak/jobstore</code>, you would look for this file at: /private/groups/patenlab/anovak/jobstore/files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam ===More Ways of Finding Files=== Sometimes, a step might not fail, but you still might want to see the files it is using as input. If you have the job store path, you can use the <code>find</code> command to try and find the files by name. For example, if you want to look at <code>Sample.bam</code>, you can look for it like this: find /path/to/the/jobstore -name "Sample.bam" If you want to find files that were ''uploaded'' from a job, look for lines like this in the job's log: [2023-07-16T15:58:39-0700] [MainThread] [D] [toil.wdl.wdltoil] Virtualized /data/tmp/2846b6012e3e5535add03b363950dd78/cb23/197c/work/bamPerChrs/Sample.chr14.bam as WDL file toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam You can take the <code>toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam</code> URI and URL-decode it with, for example, [https://www.urldecoder.io/], getting this: toilfile:2703483274:0:files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam Then you can take the part after the last colon, <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam</code>, and that is the path relative to the job store where this file can be found. ===Using Development Versions of Toil=== Sometimes, bugs will be fixed in the development version of Toil, but not released yet. To try the current development version of Toil, you can install it like this: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl,aws,google]' If you want to use a particular branch or commit, like <code>aaa451b320fc115b3563ced25cb501301cf86f90</code>, you can do: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git@aaa451b320fc115b3563ced25cb501301cf86f90#egg=toil[wdl,aws,google]' ===Frequently Asked Questions=== ====I am getting warnings about <code>XDG_RUNTIME_DIR</code>==== You may be seeing warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code> This happens because Slurm is providing Toil with an <code>XDG_RUNTIME_DIR</code> environment variable that points to a directory that doesn't exist, which the XDG spec says it shouldn't be doing. This is a known bug in the GI Slurm configuration, and Toil is letting you know that it is working around it. ====Toil said it was <code>Redirecting logging</code> somewhere, but I can't find that file!==== Toil will say that it is <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code>, and when running in single machine mode these messages go to the main Toil log. The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil. If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node. ==Additional WDL resources== For more information on writing and running WDL workflows, see: * [https://docs.openwdl.org/en/stable/ The WDL dcoumentation] * [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube] * [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification] 6129219b37718c27ee5b0696f18e0742a54753e3 418 417 2023-08-10T15:14:24Z Anovak 4 /* Frequently Asked Questions */ wikitext text/x-wiki =Tutorial: Getting Started with WDL Workflows on Phoenix= Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you. Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order. This tutorial will help you get started writing and running workflows on the Genomics Institute's Phoenix cluster. By the end, you will be able to run workflows on Phoenix with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong. ==Setup== Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH. ===Getting VPN access=== We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall. To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster. ===Connecting to Phoenix=== Once you have VPN access, you can connect to the "head node" of the Phoenix cluster. This node is where everyone logs in, but you should *not* run actual work on this node; it exists only to give you access to the files on the cluster and to the commands to control cluster jobs. To connect to the head node: 1. Connect to the VPN. 2. SSH to <code>phoenix.prism</code>. At the command line, run: ssh phoenix.prism If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run: ssh flastname@phoenix.prism The first time you connect, you will see a message like: The authenticity of host 'phoenix.prism (10.50.1.66)' can't be established. ED25519 key fingerprint is SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI. This key is not known by any other names Are you sure you want to continue connecting (yes/no/[fingerprint])? This is your computer asking you to help it decide if it is talking to the genuine <code>phoenix.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is. ===Installing Toil with WDL support=== Once you are on the head node, you can install Toil, a program for running workflows. When installing, you need to specify that you want WDL support. To do this, you can run: pip install --upgrade --user 'toil[wdl]' If you also want to use AWS S3 <code>s3://</code> and/or Google <code>gs://</code> URLs for data, you will need to also install Toil with the <code>aws</code> and <code>google</code> extras, respectively: pip install --upgrade --user 'toil[wdl,aws,google]' This will install Toil in the <code>.local</code> directory inside your home directory, which we write as <code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>. By default, the command interpreter *will not* look there, so if you type <code>toil-wdl-runner</code>, it will complain that the command is not found. To fix this, you need to configure the command interpreter (bash) to look where Toil is installed. To do this, run: echo 'export PATH="${HOME}/.local/bin:${PATH}"' >>~/.bashrc After that, **log out and log back in**, to restart bash and pick up the change. To make sure it worked, you can run: toil-wdl-runner --help If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports. If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above. ===Configuring Toil for Phoenix=== Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. So, use these commands to make sure that Toil knows where it ought to put its caches: echo 'export SINGULARITY_CACHEDIR="${HOME}/.singularity/cache"' >>~/.bashrc echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="${HOME}/.cache/miniwdl"' >>~/.bashrc After that, '''log out and log back in again''', to apply the changes. If you don't do this, Toil will re-download each container image, on each node, for each run of each workflow. That wastes a lot of time, and can exhaust the [https://docs.docker.com/docker-hub/download-rate-limit/#whats-the-download-rate-limit-on-docker-hub limits on how many containers you are allowed to download each day], so it is important not to skip this step. ==Running an existing workflow== First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project]. First, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally. wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl ===Preparing an input file=== Near the top of the WDL file, there's a section like this: workflow hello_caller { input { File who } This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one. So first, we have to make that list of names. Let's make it in <code>names.txt</code> echo "Mridula Resurrección" >names.txt echo "Gershom Šarlota" >>names.txt echo "Ritchie Ravi" >>names.txt Then, we need to create an *inputs file*, which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this: echo '{"hello_caller.who": "./names.txt"}' >inputs.json Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification]. ===Testing at small scale single-machine=== We are now ready to run the workflow! You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running: srun -c 2 --mem 8G --pty bash -i This will start a new shell; to leave it and go back to the head node you can use <code>exit</code>. In your new shell, run this Toil command: toil-wdl-runner self_test.wdl inputs.json -o local_run This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>. This will print a lot of logging to standard error, and to standard output it will print: {"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]} The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person. To leave your interactive Slurm session and return to the head node, use <code>exit</code>. ===Running at larger scale=== Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people! Go get this handy list of people and cut it to length: wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt head -n100 1000_names.txt >100_names.txt And make a new inputs file: echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster. To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it. mkdir -p logs toil-wdl-runner --jobStore ./big_store --batchSystem slurm --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory. ==Writing your own workflow== In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz]. ===Writing the file=== All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files). Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0. So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement: version 1.0 Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>. version 1.0 workflow FizzBuzz { } Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } } Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is ''not'' optional, and there is no default value, then the user's inputs file ''must'' specify a value for it in order for the workflow to run. Now we'll start on the body of the workflow, to be inserted just after the inputs section. The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable. Array[Int] numbers = range(item_count) WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more. Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 } Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array. Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only ''once'' in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>. Note that WDL supports conditional ''expressions'' with a <code>then</code> and an <code>else</code>, but conditional ''statements'' only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition. So first, let's handle the special cases. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. } } Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later. To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]) } We can put the code into the workflow now, and set about writing the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>. task stringify_number { } We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections. task stringify_number { input { Int the_number } # ??? output { String the_string # = ??? } } Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string # = ??? } } Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]). task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } } We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage. Then we can put our task into our WDL file: version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren't supposed to need this, but you do need it in 1.1 and Toil doesn't actually send your outputs anywhere yet if you don't have one, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } output { Array[String] fizzbuzz_results = result } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array. Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs: echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json Then run it on the cluster with Toil: mkdir -p logs toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json Or locally: toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json ==Debugging Workflows== Sometimes, your workflow won't work. Try these ideas for figuring out what is going wrong. ===Debugging Options=== When debugging a workflow, make sure to run the workflow with <code>--logDebug</code>, to set the log level to <code>DEBUG</code>, and with <code>--jobStore /some/path/to/a/shared/directory/it/can/create</code> so that the stored files shipped between jobs are in a place you can access them. When debug logging is on, the log from every Toil job is inserted in the main Toil log between these markers: =========> Toil job log is here <========= Normally, only the logs of failing jobs are reproduced like this. ===Reading the Log=== When a WDL workflow fails, you are likely to see a message like this: WDL.runtime.error.CommandFailed: task command failed with exit status 1 [2023-07-16T16:23:54-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host phoenix-15.prism This means that the command line command specified by one of your WDL tasks exited with a failing (i.e. nonzero) exit code, which will happen when either the command line command is written wrong, or when the error detection code in the tool you are trying to run detects and reports an error. Go up higher in the log until you find lines that look like: [2023-07-16T16:23:54-0700] [MainThread] [I] [toil.wdl.wdltoil] Standard error at /data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/stderr.txt: And [2023-07-16T16:23:54-0700] [MainThread] [I] [toil.wdl.wdltoil] Standard output at /data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/stdout.txt: These will be followed by the standard error and standard output log data from the task's command. There may be useful information (such as an error message from the underlying tool) in there. ===Reproducing Problems=== When trying to fix a failing step, it is useful to be able to run a command outside of Toil or WDL that might reproduce the problem. In addition to getting the standard output and standard error logs as described above, you may also need input files for your tool in order to do this. In the log of your failing Toil taks, look for likes like this: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-4f886176ab8344baaf17dc72fc445445/toplog.sh' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmprwhi6h3q/toplog.sh' [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmpjyksfoko/Sample.bam' ... The <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam</code> part is a Toil file ID, and it is a relative path from your <code>--jobStore</code> value to where the file is stored on disk. So if you ran the workflow with <code>--jobStore /private/groups/patenlab/anovak/jobstore</code>, you would look for this file at: /private/groups/patenlab/anovak/jobstore/files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam ===More Ways of Finding Files=== Sometimes, a step might not fail, but you still might want to see the files it is using as input. If you have the job store path, you can use the <code>find</code> command to try and find the files by name. For example, if you want to look at <code>Sample.bam</code>, you can look for it like this: find /path/to/the/jobstore -name "Sample.bam" If you want to find files that were ''uploaded'' from a job, look for lines like this in the job's log: [2023-07-16T15:58:39-0700] [MainThread] [D] [toil.wdl.wdltoil] Virtualized /data/tmp/2846b6012e3e5535add03b363950dd78/cb23/197c/work/bamPerChrs/Sample.chr14.bam as WDL file toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam You can take the <code>toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam</code> URI and URL-decode it with, for example, [https://www.urldecoder.io/], getting this: toilfile:2703483274:0:files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam Then you can take the part after the last colon, <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam</code>, and that is the path relative to the job store where this file can be found. ===Using Development Versions of Toil=== Sometimes, bugs will be fixed in the development version of Toil, but not released yet. To try the current development version of Toil, you can install it like this: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl,aws,google]' If you want to use a particular branch or commit, like <code>aaa451b320fc115b3563ced25cb501301cf86f90</code>, you can do: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git@aaa451b320fc115b3563ced25cb501301cf86f90#egg=toil[wdl,aws,google]' ==Frequently Asked Questions== ===I am getting warnings about <code>XDG_RUNTIME_DIR</code>=== You may be seeing warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code> This happens because Slurm is providing Toil with an <code>XDG_RUNTIME_DIR</code> environment variable that points to a directory that doesn't exist, which the XDG spec says it shouldn't be doing. This is a known bug in the GI Slurm configuration, and Toil is letting you know that it is working around it. ===Toil said it was <code>Redirecting logging</code> somewhere, but I can't find that file!=== Toil will say that it is <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code>, and when running in single machine mode these messages go to the main Toil log. The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil. If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node. ==Additional WDL resources== For more information on writing and running WDL workflows, see: * [https://docs.openwdl.org/en/stable/ The WDL dcoumentation] * [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube] * [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification] 007834ed9e2d825a835e1c1fccf3364e4f8cc5cd 419 418 2023-08-10T15:17:19Z Anovak 4 wikitext text/x-wiki '''Tutorial: Getting Started with WDL Workflows on Phoenix''' Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you. Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order. This tutorial will help you get started writing and running workflows on the Genomics Institute's Phoenix cluster. By the end, you will be able to run workflows on Phoenix with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong. =Setup= Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH. ==Getting VPN access== We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall. To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster. ==Connecting to Phoenix== Once you have VPN access, you can connect to the "head node" of the Phoenix cluster. This node is where everyone logs in, but you should *not* run actual work on this node; it exists only to give you access to the files on the cluster and to the commands to control cluster jobs. To connect to the head node: 1. Connect to the VPN. 2. SSH to <code>phoenix.prism</code>. At the command line, run: ssh phoenix.prism If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run: ssh flastname@phoenix.prism The first time you connect, you will see a message like: The authenticity of host 'phoenix.prism (10.50.1.66)' can't be established. ED25519 key fingerprint is SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI. This key is not known by any other names Are you sure you want to continue connecting (yes/no/[fingerprint])? This is your computer asking you to help it decide if it is talking to the genuine <code>phoenix.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is. ==Installing Toil with WDL support== Once you are on the head node, you can install Toil, a program for running workflows. When installing, you need to specify that you want WDL support. To do this, you can run: pip install --upgrade --user 'toil[wdl]' If you also want to use AWS S3 <code>s3://</code> and/or Google <code>gs://</code> URLs for data, you will need to also install Toil with the <code>aws</code> and <code>google</code> extras, respectively: pip install --upgrade --user 'toil[wdl,aws,google]' This will install Toil in the <code>.local</code> directory inside your home directory, which we write as <code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>. By default, the command interpreter *will not* look there, so if you type <code>toil-wdl-runner</code>, it will complain that the command is not found. To fix this, you need to configure the command interpreter (bash) to look where Toil is installed. To do this, run: echo 'export PATH="${HOME}/.local/bin:${PATH}"' >>~/.bashrc After that, **log out and log back in**, to restart bash and pick up the change. To make sure it worked, you can run: toil-wdl-runner --help If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports. If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above. ==Configuring Toil for Phoenix== Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. So, use these commands to make sure that Toil knows where it ought to put its caches: echo 'export SINGULARITY_CACHEDIR="${HOME}/.singularity/cache"' >>~/.bashrc echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="${HOME}/.cache/miniwdl"' >>~/.bashrc After that, '''log out and log back in again''', to apply the changes. If you don't do this, Toil will re-download each container image, on each node, for each run of each workflow. That wastes a lot of time, and can exhaust the [https://docs.docker.com/docker-hub/download-rate-limit/#whats-the-download-rate-limit-on-docker-hub limits on how many containers you are allowed to download each day], so it is important not to skip this step. =Running an existing workflow= First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project]. First, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally. wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl ==Preparing an input file== Near the top of the WDL file, there's a section like this: workflow hello_caller { input { File who } This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one. So first, we have to make that list of names. Let's make it in <code>names.txt</code> echo "Mridula Resurrección" >names.txt echo "Gershom Šarlota" >>names.txt echo "Ritchie Ravi" >>names.txt Then, we need to create an *inputs file*, which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this: echo '{"hello_caller.who": "./names.txt"}' >inputs.json Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification]. ==Testing at small scale single-machine== We are now ready to run the workflow! You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running: srun -c 2 --mem 8G --pty bash -i This will start a new shell; to leave it and go back to the head node you can use <code>exit</code>. In your new shell, run this Toil command: toil-wdl-runner self_test.wdl inputs.json -o local_run This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>. This will print a lot of logging to standard error, and to standard output it will print: {"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]} The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person. To leave your interactive Slurm session and return to the head node, use <code>exit</code>. ==Running at larger scale== Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people! Go get this handy list of people and cut it to length: wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt head -n100 1000_names.txt >100_names.txt And make a new inputs file: echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster. To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it. mkdir -p logs toil-wdl-runner --jobStore ./big_store --batchSystem slurm --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory. =Writing your own workflow= In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz]. ==Writing the file== All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files). Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0. So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement: version 1.0 Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>. version 1.0 workflow FizzBuzz { } Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } } Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is ''not'' optional, and there is no default value, then the user's inputs file ''must'' specify a value for it in order for the workflow to run. Now we'll start on the body of the workflow, to be inserted just after the inputs section. The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable. Array[Int] numbers = range(item_count) WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more. Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 } Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array. Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only ''once'' in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>. Note that WDL supports conditional ''expressions'' with a <code>then</code> and an <code>else</code>, but conditional ''statements'' only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition. So first, let's handle the special cases. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. } } Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later. To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]) } We can put the code into the workflow now, and set about writing the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>. task stringify_number { } We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections. task stringify_number { input { Int the_number } # ??? output { String the_string # = ??? } } Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string # = ??? } } Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]). task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } } We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage. Then we can put our task into our WDL file: version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren't supposed to need this, but you do need it in 1.1 and Toil doesn't actually send your outputs anywhere yet if you don't have one, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } output { Array[String] fizzbuzz_results = result } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array. ==Running the Workflow== Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs: echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json Then run it on the cluster with Toil: mkdir -p logs toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json Or locally: toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json =Debugging Workflows= Sometimes, your workflow won't work. Try these ideas for figuring out what is going wrong. ==Debugging Options== When debugging a workflow, make sure to run the workflow with <code>--logDebug</code>, to set the log level to <code>DEBUG</code>, and with <code>--jobStore /some/path/to/a/shared/directory/it/can/create</code> so that the stored files shipped between jobs are in a place you can access them. When debug logging is on, the log from every Toil job is inserted in the main Toil log between these markers: =========> Toil job log is here <========= Normally, only the logs of failing jobs are reproduced like this. ==Reading the Log== When a WDL workflow fails, you are likely to see a message like this: WDL.runtime.error.CommandFailed: task command failed with exit status 1 [2023-07-16T16:23:54-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host phoenix-15.prism This means that the command line command specified by one of your WDL tasks exited with a failing (i.e. nonzero) exit code, which will happen when either the command line command is written wrong, or when the error detection code in the tool you are trying to run detects and reports an error. Go up higher in the log until you find lines that look like: [2023-07-16T16:23:54-0700] [MainThread] [I] [toil.wdl.wdltoil] Standard error at /data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/stderr.txt: And [2023-07-16T16:23:54-0700] [MainThread] [I] [toil.wdl.wdltoil] Standard output at /data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/stdout.txt: These will be followed by the standard error and standard output log data from the task's command. There may be useful information (such as an error message from the underlying tool) in there. ==Reproducing Problems== When trying to fix a failing step, it is useful to be able to run a command outside of Toil or WDL that might reproduce the problem. In addition to getting the standard output and standard error logs as described above, you may also need input files for your tool in order to do this. In the log of your failing Toil taks, look for likes like this: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-4f886176ab8344baaf17dc72fc445445/toplog.sh' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmprwhi6h3q/toplog.sh' [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmpjyksfoko/Sample.bam' ... The <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam</code> part is a Toil file ID, and it is a relative path from your <code>--jobStore</code> value to where the file is stored on disk. So if you ran the workflow with <code>--jobStore /private/groups/patenlab/anovak/jobstore</code>, you would look for this file at: /private/groups/patenlab/anovak/jobstore/files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam ==More Ways of Finding Files== Sometimes, a step might not fail, but you still might want to see the files it is using as input. If you have the job store path, you can use the <code>find</code> command to try and find the files by name. For example, if you want to look at <code>Sample.bam</code>, you can look for it like this: find /path/to/the/jobstore -name "Sample.bam" If you want to find files that were ''uploaded'' from a job, look for lines like this in the job's log: [2023-07-16T15:58:39-0700] [MainThread] [D] [toil.wdl.wdltoil] Virtualized /data/tmp/2846b6012e3e5535add03b363950dd78/cb23/197c/work/bamPerChrs/Sample.chr14.bam as WDL file toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam You can take the <code>toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam</code> URI and URL-decode it with, for example, [https://www.urldecoder.io/], getting this: toilfile:2703483274:0:files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam Then you can take the part after the last colon, <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam</code>, and that is the path relative to the job store where this file can be found. ==Using Development Versions of Toil== Sometimes, bugs will be fixed in the development version of Toil, but not released yet. To try the current development version of Toil, you can install it like this: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl,aws,google]' If you want to use a particular branch or commit, like <code>aaa451b320fc115b3563ced25cb501301cf86f90</code>, you can do: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git@aaa451b320fc115b3563ced25cb501301cf86f90#egg=toil[wdl,aws,google]' ==Frequently Asked Questions== ===I am getting warnings about <code>XDG_RUNTIME_DIR</code>=== You may be seeing warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code> This happens because Slurm is providing Toil with an <code>XDG_RUNTIME_DIR</code> environment variable that points to a directory that doesn't exist, which the XDG spec says it shouldn't be doing. This is a known bug in the GI Slurm configuration, and Toil is letting you know that it is working around it. ===Toil said it was <code>Redirecting logging</code> somewhere, but I can't find that file!=== Toil will say that it is <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code>, and when running in single machine mode these messages go to the main Toil log. The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil. If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node. =Additional WDL resources= For more information on writing and running WDL workflows, see: * [https://docs.openwdl.org/en/stable/ The WDL dcoumentation] * [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube] * [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification] b32acae3cbf5ff88ec9f812a747e024bd9ec420d 420 419 2023-08-10T15:21:13Z Anovak 4 /* Writing the file */ wikitext text/x-wiki '''Tutorial: Getting Started with WDL Workflows on Phoenix''' Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you. Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order. This tutorial will help you get started writing and running workflows on the Genomics Institute's Phoenix cluster. By the end, you will be able to run workflows on Phoenix with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong. =Setup= Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH. ==Getting VPN access== We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall. To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster. ==Connecting to Phoenix== Once you have VPN access, you can connect to the "head node" of the Phoenix cluster. This node is where everyone logs in, but you should *not* run actual work on this node; it exists only to give you access to the files on the cluster and to the commands to control cluster jobs. To connect to the head node: 1. Connect to the VPN. 2. SSH to <code>phoenix.prism</code>. At the command line, run: ssh phoenix.prism If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run: ssh flastname@phoenix.prism The first time you connect, you will see a message like: The authenticity of host 'phoenix.prism (10.50.1.66)' can't be established. ED25519 key fingerprint is SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI. This key is not known by any other names Are you sure you want to continue connecting (yes/no/[fingerprint])? This is your computer asking you to help it decide if it is talking to the genuine <code>phoenix.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is. ==Installing Toil with WDL support== Once you are on the head node, you can install Toil, a program for running workflows. When installing, you need to specify that you want WDL support. To do this, you can run: pip install --upgrade --user 'toil[wdl]' If you also want to use AWS S3 <code>s3://</code> and/or Google <code>gs://</code> URLs for data, you will need to also install Toil with the <code>aws</code> and <code>google</code> extras, respectively: pip install --upgrade --user 'toil[wdl,aws,google]' This will install Toil in the <code>.local</code> directory inside your home directory, which we write as <code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>. By default, the command interpreter *will not* look there, so if you type <code>toil-wdl-runner</code>, it will complain that the command is not found. To fix this, you need to configure the command interpreter (bash) to look where Toil is installed. To do this, run: echo 'export PATH="${HOME}/.local/bin:${PATH}"' >>~/.bashrc After that, **log out and log back in**, to restart bash and pick up the change. To make sure it worked, you can run: toil-wdl-runner --help If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports. If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above. ==Configuring Toil for Phoenix== Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. So, use these commands to make sure that Toil knows where it ought to put its caches: echo 'export SINGULARITY_CACHEDIR="${HOME}/.singularity/cache"' >>~/.bashrc echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="${HOME}/.cache/miniwdl"' >>~/.bashrc After that, '''log out and log back in again''', to apply the changes. If you don't do this, Toil will re-download each container image, on each node, for each run of each workflow. That wastes a lot of time, and can exhaust the [https://docs.docker.com/docker-hub/download-rate-limit/#whats-the-download-rate-limit-on-docker-hub limits on how many containers you are allowed to download each day], so it is important not to skip this step. =Running an existing workflow= First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project]. First, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally. wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl ==Preparing an input file== Near the top of the WDL file, there's a section like this: workflow hello_caller { input { File who } This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one. So first, we have to make that list of names. Let's make it in <code>names.txt</code> echo "Mridula Resurrección" >names.txt echo "Gershom Šarlota" >>names.txt echo "Ritchie Ravi" >>names.txt Then, we need to create an *inputs file*, which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this: echo '{"hello_caller.who": "./names.txt"}' >inputs.json Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification]. ==Testing at small scale single-machine== We are now ready to run the workflow! You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running: srun -c 2 --mem 8G --pty bash -i This will start a new shell; to leave it and go back to the head node you can use <code>exit</code>. In your new shell, run this Toil command: toil-wdl-runner self_test.wdl inputs.json -o local_run This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>. This will print a lot of logging to standard error, and to standard output it will print: {"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]} The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person. To leave your interactive Slurm session and return to the head node, use <code>exit</code>. ==Running at larger scale== Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people! Go get this handy list of people and cut it to length: wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt head -n100 1000_names.txt >100_names.txt And make a new inputs file: echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster. To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it. mkdir -p logs toil-wdl-runner --jobStore ./big_store --batchSystem slurm --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory. =Writing your own workflow= In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz]. ==Writing the file== ===Version=== All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files). Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0. So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement: version 1.0 ===Workflow Block=== Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>. version 1.0 workflow FizzBuzz { } ===Input Block=== Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } } Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is ''not'' optional, and there is no default value, then the user's inputs file ''must'' specify a value for it in order for the workflow to run. ===Body=== Now we'll start on the body of the workflow, to be inserted just after the inputs section. The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable. Array[Int] numbers = range(item_count) WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more. ===Scattering=== Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 } ===Conditionals=== Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array. Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only ''once'' in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>. Note that WDL supports conditional ''expressions'' with a <code>then</code> and an <code>else</code>, but conditional ''statements'' only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition. So first, let's handle the special cases. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. } } ===Calling Tasks=== Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later. To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]) } We can put the code into the workflow now, and set about writing the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } ===Writing Tasks=== Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>. task stringify_number { } We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections. task stringify_number { input { Int the_number } # ??? output { String the_string # = ??? } } Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string # = ??? } } Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]). task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } } We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage. Then we can put our task into our WDL file: version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } ===Output Block=== Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren't supposed to need this, but you do need it in 1.1 and Toil doesn't actually send your outputs anywhere yet if you don't have one, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } output { Array[String] fizzbuzz_results = result } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array. ==Running the Workflow== Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs: echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json Then run it on the cluster with Toil: mkdir -p logs toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json Or locally: toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json =Debugging Workflows= Sometimes, your workflow won't work. Try these ideas for figuring out what is going wrong. ==Debugging Options== When debugging a workflow, make sure to run the workflow with <code>--logDebug</code>, to set the log level to <code>DEBUG</code>, and with <code>--jobStore /some/path/to/a/shared/directory/it/can/create</code> so that the stored files shipped between jobs are in a place you can access them. When debug logging is on, the log from every Toil job is inserted in the main Toil log between these markers: =========> Toil job log is here <========= Normally, only the logs of failing jobs are reproduced like this. ==Reading the Log== When a WDL workflow fails, you are likely to see a message like this: WDL.runtime.error.CommandFailed: task command failed with exit status 1 [2023-07-16T16:23:54-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host phoenix-15.prism This means that the command line command specified by one of your WDL tasks exited with a failing (i.e. nonzero) exit code, which will happen when either the command line command is written wrong, or when the error detection code in the tool you are trying to run detects and reports an error. Go up higher in the log until you find lines that look like: [2023-07-16T16:23:54-0700] [MainThread] [I] [toil.wdl.wdltoil] Standard error at /data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/stderr.txt: And [2023-07-16T16:23:54-0700] [MainThread] [I] [toil.wdl.wdltoil] Standard output at /data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/stdout.txt: These will be followed by the standard error and standard output log data from the task's command. There may be useful information (such as an error message from the underlying tool) in there. ==Reproducing Problems== When trying to fix a failing step, it is useful to be able to run a command outside of Toil or WDL that might reproduce the problem. In addition to getting the standard output and standard error logs as described above, you may also need input files for your tool in order to do this. In the log of your failing Toil taks, look for likes like this: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-4f886176ab8344baaf17dc72fc445445/toplog.sh' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmprwhi6h3q/toplog.sh' [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmpjyksfoko/Sample.bam' ... The <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam</code> part is a Toil file ID, and it is a relative path from your <code>--jobStore</code> value to where the file is stored on disk. So if you ran the workflow with <code>--jobStore /private/groups/patenlab/anovak/jobstore</code>, you would look for this file at: /private/groups/patenlab/anovak/jobstore/files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam ==More Ways of Finding Files== Sometimes, a step might not fail, but you still might want to see the files it is using as input. If you have the job store path, you can use the <code>find</code> command to try and find the files by name. For example, if you want to look at <code>Sample.bam</code>, you can look for it like this: find /path/to/the/jobstore -name "Sample.bam" If you want to find files that were ''uploaded'' from a job, look for lines like this in the job's log: [2023-07-16T15:58:39-0700] [MainThread] [D] [toil.wdl.wdltoil] Virtualized /data/tmp/2846b6012e3e5535add03b363950dd78/cb23/197c/work/bamPerChrs/Sample.chr14.bam as WDL file toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam You can take the <code>toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam</code> URI and URL-decode it with, for example, [https://www.urldecoder.io/], getting this: toilfile:2703483274:0:files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam Then you can take the part after the last colon, <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam</code>, and that is the path relative to the job store where this file can be found. ==Using Development Versions of Toil== Sometimes, bugs will be fixed in the development version of Toil, but not released yet. To try the current development version of Toil, you can install it like this: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl,aws,google]' If you want to use a particular branch or commit, like <code>aaa451b320fc115b3563ced25cb501301cf86f90</code>, you can do: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git@aaa451b320fc115b3563ced25cb501301cf86f90#egg=toil[wdl,aws,google]' ==Frequently Asked Questions== ===I am getting warnings about <code>XDG_RUNTIME_DIR</code>=== You may be seeing warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code> This happens because Slurm is providing Toil with an <code>XDG_RUNTIME_DIR</code> environment variable that points to a directory that doesn't exist, which the XDG spec says it shouldn't be doing. This is a known bug in the GI Slurm configuration, and Toil is letting you know that it is working around it. ===Toil said it was <code>Redirecting logging</code> somewhere, but I can't find that file!=== Toil will say that it is <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code>, and when running in single machine mode these messages go to the main Toil log. The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil. If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node. =Additional WDL resources= For more information on writing and running WDL workflows, see: * [https://docs.openwdl.org/en/stable/ The WDL dcoumentation] * [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube] * [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification] 3ae46bb50fe340ede0fdc2c23f74e123c73bd26c 421 420 2023-08-10T15:21:50Z Anovak 4 /* Writing Tasks */ wikitext text/x-wiki '''Tutorial: Getting Started with WDL Workflows on Phoenix''' Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you. Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order. This tutorial will help you get started writing and running workflows on the Genomics Institute's Phoenix cluster. By the end, you will be able to run workflows on Phoenix with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong. =Setup= Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH. ==Getting VPN access== We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall. To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster. ==Connecting to Phoenix== Once you have VPN access, you can connect to the "head node" of the Phoenix cluster. This node is where everyone logs in, but you should *not* run actual work on this node; it exists only to give you access to the files on the cluster and to the commands to control cluster jobs. To connect to the head node: 1. Connect to the VPN. 2. SSH to <code>phoenix.prism</code>. At the command line, run: ssh phoenix.prism If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run: ssh flastname@phoenix.prism The first time you connect, you will see a message like: The authenticity of host 'phoenix.prism (10.50.1.66)' can't be established. ED25519 key fingerprint is SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI. This key is not known by any other names Are you sure you want to continue connecting (yes/no/[fingerprint])? This is your computer asking you to help it decide if it is talking to the genuine <code>phoenix.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is. ==Installing Toil with WDL support== Once you are on the head node, you can install Toil, a program for running workflows. When installing, you need to specify that you want WDL support. To do this, you can run: pip install --upgrade --user 'toil[wdl]' If you also want to use AWS S3 <code>s3://</code> and/or Google <code>gs://</code> URLs for data, you will need to also install Toil with the <code>aws</code> and <code>google</code> extras, respectively: pip install --upgrade --user 'toil[wdl,aws,google]' This will install Toil in the <code>.local</code> directory inside your home directory, which we write as <code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>. By default, the command interpreter *will not* look there, so if you type <code>toil-wdl-runner</code>, it will complain that the command is not found. To fix this, you need to configure the command interpreter (bash) to look where Toil is installed. To do this, run: echo 'export PATH="${HOME}/.local/bin:${PATH}"' >>~/.bashrc After that, **log out and log back in**, to restart bash and pick up the change. To make sure it worked, you can run: toil-wdl-runner --help If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports. If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above. ==Configuring Toil for Phoenix== Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. So, use these commands to make sure that Toil knows where it ought to put its caches: echo 'export SINGULARITY_CACHEDIR="${HOME}/.singularity/cache"' >>~/.bashrc echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="${HOME}/.cache/miniwdl"' >>~/.bashrc After that, '''log out and log back in again''', to apply the changes. If you don't do this, Toil will re-download each container image, on each node, for each run of each workflow. That wastes a lot of time, and can exhaust the [https://docs.docker.com/docker-hub/download-rate-limit/#whats-the-download-rate-limit-on-docker-hub limits on how many containers you are allowed to download each day], so it is important not to skip this step. =Running an existing workflow= First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project]. First, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally. wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl ==Preparing an input file== Near the top of the WDL file, there's a section like this: workflow hello_caller { input { File who } This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one. So first, we have to make that list of names. Let's make it in <code>names.txt</code> echo "Mridula Resurrección" >names.txt echo "Gershom Šarlota" >>names.txt echo "Ritchie Ravi" >>names.txt Then, we need to create an *inputs file*, which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this: echo '{"hello_caller.who": "./names.txt"}' >inputs.json Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification]. ==Testing at small scale single-machine== We are now ready to run the workflow! You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running: srun -c 2 --mem 8G --pty bash -i This will start a new shell; to leave it and go back to the head node you can use <code>exit</code>. In your new shell, run this Toil command: toil-wdl-runner self_test.wdl inputs.json -o local_run This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>. This will print a lot of logging to standard error, and to standard output it will print: {"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]} The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person. To leave your interactive Slurm session and return to the head node, use <code>exit</code>. ==Running at larger scale== Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people! Go get this handy list of people and cut it to length: wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt head -n100 1000_names.txt >100_names.txt And make a new inputs file: echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster. To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it. mkdir -p logs toil-wdl-runner --jobStore ./big_store --batchSystem slurm --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory. =Writing your own workflow= In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz]. ==Writing the file== ===Version=== All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files). Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0. So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement: version 1.0 ===Workflow Block=== Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>. version 1.0 workflow FizzBuzz { } ===Input Block=== Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } } Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is ''not'' optional, and there is no default value, then the user's inputs file ''must'' specify a value for it in order for the workflow to run. ===Body=== Now we'll start on the body of the workflow, to be inserted just after the inputs section. The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable. Array[Int] numbers = range(item_count) WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more. ===Scattering=== Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 } ===Conditionals=== Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array. Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only ''once'' in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>. Note that WDL supports conditional ''expressions'' with a <code>then</code> and an <code>else</code>, but conditional ''statements'' only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition. So first, let's handle the special cases. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. } } ===Calling Tasks=== Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later. To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]) } We can put the code into the workflow now, and set about writing the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } ===Writing Tasks=== Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>. task stringify_number { } We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections. task stringify_number { input { Int the_number } # ??? output { String the_string # = ??? } } Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string # = ??? } } Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]). task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } } We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage. Then we can put our task into our WDL file: version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } ===Output Block=== Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren't supposed to need this, but you do need it in 1.1 and Toil doesn't actually send your outputs anywhere yet if you don't have one, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } output { Array[String] fizzbuzz_results = result } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array. ==Running the Workflow== Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs: echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json Then run it on the cluster with Toil: mkdir -p logs toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json Or locally: toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json =Debugging Workflows= Sometimes, your workflow won't work. Try these ideas for figuring out what is going wrong. ==Debugging Options== When debugging a workflow, make sure to run the workflow with <code>--logDebug</code>, to set the log level to <code>DEBUG</code>, and with <code>--jobStore /some/path/to/a/shared/directory/it/can/create</code> so that the stored files shipped between jobs are in a place you can access them. When debug logging is on, the log from every Toil job is inserted in the main Toil log between these markers: =========> Toil job log is here <========= Normally, only the logs of failing jobs are reproduced like this. ==Reading the Log== When a WDL workflow fails, you are likely to see a message like this: WDL.runtime.error.CommandFailed: task command failed with exit status 1 [2023-07-16T16:23:54-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host phoenix-15.prism This means that the command line command specified by one of your WDL tasks exited with a failing (i.e. nonzero) exit code, which will happen when either the command line command is written wrong, or when the error detection code in the tool you are trying to run detects and reports an error. Go up higher in the log until you find lines that look like: [2023-07-16T16:23:54-0700] [MainThread] [I] [toil.wdl.wdltoil] Standard error at /data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/stderr.txt: And [2023-07-16T16:23:54-0700] [MainThread] [I] [toil.wdl.wdltoil] Standard output at /data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/stdout.txt: These will be followed by the standard error and standard output log data from the task's command. There may be useful information (such as an error message from the underlying tool) in there. ==Reproducing Problems== When trying to fix a failing step, it is useful to be able to run a command outside of Toil or WDL that might reproduce the problem. In addition to getting the standard output and standard error logs as described above, you may also need input files for your tool in order to do this. In the log of your failing Toil taks, look for likes like this: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-4f886176ab8344baaf17dc72fc445445/toplog.sh' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmprwhi6h3q/toplog.sh' [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmpjyksfoko/Sample.bam' ... The <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam</code> part is a Toil file ID, and it is a relative path from your <code>--jobStore</code> value to where the file is stored on disk. So if you ran the workflow with <code>--jobStore /private/groups/patenlab/anovak/jobstore</code>, you would look for this file at: /private/groups/patenlab/anovak/jobstore/files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam ==More Ways of Finding Files== Sometimes, a step might not fail, but you still might want to see the files it is using as input. If you have the job store path, you can use the <code>find</code> command to try and find the files by name. For example, if you want to look at <code>Sample.bam</code>, you can look for it like this: find /path/to/the/jobstore -name "Sample.bam" If you want to find files that were ''uploaded'' from a job, look for lines like this in the job's log: [2023-07-16T15:58:39-0700] [MainThread] [D] [toil.wdl.wdltoil] Virtualized /data/tmp/2846b6012e3e5535add03b363950dd78/cb23/197c/work/bamPerChrs/Sample.chr14.bam as WDL file toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam You can take the <code>toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam</code> URI and URL-decode it with, for example, [https://www.urldecoder.io/], getting this: toilfile:2703483274:0:files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam Then you can take the part after the last colon, <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam</code>, and that is the path relative to the job store where this file can be found. ==Using Development Versions of Toil== Sometimes, bugs will be fixed in the development version of Toil, but not released yet. To try the current development version of Toil, you can install it like this: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl,aws,google]' If you want to use a particular branch or commit, like <code>aaa451b320fc115b3563ced25cb501301cf86f90</code>, you can do: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git@aaa451b320fc115b3563ced25cb501301cf86f90#egg=toil[wdl,aws,google]' ==Frequently Asked Questions== ===I am getting warnings about <code>XDG_RUNTIME_DIR</code>=== You may be seeing warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code> This happens because Slurm is providing Toil with an <code>XDG_RUNTIME_DIR</code> environment variable that points to a directory that doesn't exist, which the XDG spec says it shouldn't be doing. This is a known bug in the GI Slurm configuration, and Toil is letting you know that it is working around it. ===Toil said it was <code>Redirecting logging</code> somewhere, but I can't find that file!=== Toil will say that it is <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code>, and when running in single machine mode these messages go to the main Toil log. The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil. If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node. =Additional WDL resources= For more information on writing and running WDL workflows, see: * [https://docs.openwdl.org/en/stable/ The WDL dcoumentation] * [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube] * [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification] 4115b0b3b39ee6b7531b49b32365d3ad0bc91574 422 421 2023-08-10T15:24:10Z Anovak 4 /* Toil said it was Redirecting logging somewhere, but I can't find that file! */ wikitext text/x-wiki '''Tutorial: Getting Started with WDL Workflows on Phoenix''' Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you. Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order. This tutorial will help you get started writing and running workflows on the Genomics Institute's Phoenix cluster. By the end, you will be able to run workflows on Phoenix with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong. =Setup= Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH. ==Getting VPN access== We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall. To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster. ==Connecting to Phoenix== Once you have VPN access, you can connect to the "head node" of the Phoenix cluster. This node is where everyone logs in, but you should *not* run actual work on this node; it exists only to give you access to the files on the cluster and to the commands to control cluster jobs. To connect to the head node: 1. Connect to the VPN. 2. SSH to <code>phoenix.prism</code>. At the command line, run: ssh phoenix.prism If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run: ssh flastname@phoenix.prism The first time you connect, you will see a message like: The authenticity of host 'phoenix.prism (10.50.1.66)' can't be established. ED25519 key fingerprint is SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI. This key is not known by any other names Are you sure you want to continue connecting (yes/no/[fingerprint])? This is your computer asking you to help it decide if it is talking to the genuine <code>phoenix.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is. ==Installing Toil with WDL support== Once you are on the head node, you can install Toil, a program for running workflows. When installing, you need to specify that you want WDL support. To do this, you can run: pip install --upgrade --user 'toil[wdl]' If you also want to use AWS S3 <code>s3://</code> and/or Google <code>gs://</code> URLs for data, you will need to also install Toil with the <code>aws</code> and <code>google</code> extras, respectively: pip install --upgrade --user 'toil[wdl,aws,google]' This will install Toil in the <code>.local</code> directory inside your home directory, which we write as <code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>. By default, the command interpreter *will not* look there, so if you type <code>toil-wdl-runner</code>, it will complain that the command is not found. To fix this, you need to configure the command interpreter (bash) to look where Toil is installed. To do this, run: echo 'export PATH="${HOME}/.local/bin:${PATH}"' >>~/.bashrc After that, **log out and log back in**, to restart bash and pick up the change. To make sure it worked, you can run: toil-wdl-runner --help If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports. If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above. ==Configuring Toil for Phoenix== Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. So, use these commands to make sure that Toil knows where it ought to put its caches: echo 'export SINGULARITY_CACHEDIR="${HOME}/.singularity/cache"' >>~/.bashrc echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="${HOME}/.cache/miniwdl"' >>~/.bashrc After that, '''log out and log back in again''', to apply the changes. If you don't do this, Toil will re-download each container image, on each node, for each run of each workflow. That wastes a lot of time, and can exhaust the [https://docs.docker.com/docker-hub/download-rate-limit/#whats-the-download-rate-limit-on-docker-hub limits on how many containers you are allowed to download each day], so it is important not to skip this step. =Running an existing workflow= First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project]. First, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally. wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl ==Preparing an input file== Near the top of the WDL file, there's a section like this: workflow hello_caller { input { File who } This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one. So first, we have to make that list of names. Let's make it in <code>names.txt</code> echo "Mridula Resurrección" >names.txt echo "Gershom Šarlota" >>names.txt echo "Ritchie Ravi" >>names.txt Then, we need to create an *inputs file*, which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this: echo '{"hello_caller.who": "./names.txt"}' >inputs.json Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification]. ==Testing at small scale single-machine== We are now ready to run the workflow! You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running: srun -c 2 --mem 8G --pty bash -i This will start a new shell; to leave it and go back to the head node you can use <code>exit</code>. In your new shell, run this Toil command: toil-wdl-runner self_test.wdl inputs.json -o local_run This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>. This will print a lot of logging to standard error, and to standard output it will print: {"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]} The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person. To leave your interactive Slurm session and return to the head node, use <code>exit</code>. ==Running at larger scale== Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people! Go get this handy list of people and cut it to length: wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt head -n100 1000_names.txt >100_names.txt And make a new inputs file: echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster. To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it. mkdir -p logs toil-wdl-runner --jobStore ./big_store --batchSystem slurm --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory. =Writing your own workflow= In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz]. ==Writing the file== ===Version=== All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files). Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0. So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement: version 1.0 ===Workflow Block=== Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>. version 1.0 workflow FizzBuzz { } ===Input Block=== Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } } Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is ''not'' optional, and there is no default value, then the user's inputs file ''must'' specify a value for it in order for the workflow to run. ===Body=== Now we'll start on the body of the workflow, to be inserted just after the inputs section. The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable. Array[Int] numbers = range(item_count) WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more. ===Scattering=== Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 } ===Conditionals=== Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array. Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only ''once'' in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>. Note that WDL supports conditional ''expressions'' with a <code>then</code> and an <code>else</code>, but conditional ''statements'' only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition. So first, let's handle the special cases. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. } } ===Calling Tasks=== Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later. To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]) } We can put the code into the workflow now, and set about writing the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } ===Writing Tasks=== Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>. task stringify_number { } We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections. task stringify_number { input { Int the_number } # ??? output { String the_string # = ??? } } Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string # = ??? } } Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]). task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } } We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage. Then we can put our task into our WDL file: version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } ===Output Block=== Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren't supposed to need this, but you do need it in 1.1 and Toil doesn't actually send your outputs anywhere yet if you don't have one, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } output { Array[String] fizzbuzz_results = result } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array. ==Running the Workflow== Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs: echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json Then run it on the cluster with Toil: mkdir -p logs toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json Or locally: toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json =Debugging Workflows= Sometimes, your workflow won't work. Try these ideas for figuring out what is going wrong. ==Debugging Options== When debugging a workflow, make sure to run the workflow with <code>--logDebug</code>, to set the log level to <code>DEBUG</code>, and with <code>--jobStore /some/path/to/a/shared/directory/it/can/create</code> so that the stored files shipped between jobs are in a place you can access them. When debug logging is on, the log from every Toil job is inserted in the main Toil log between these markers: =========> Toil job log is here <========= Normally, only the logs of failing jobs are reproduced like this. ==Reading the Log== When a WDL workflow fails, you are likely to see a message like this: WDL.runtime.error.CommandFailed: task command failed with exit status 1 [2023-07-16T16:23:54-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host phoenix-15.prism This means that the command line command specified by one of your WDL tasks exited with a failing (i.e. nonzero) exit code, which will happen when either the command line command is written wrong, or when the error detection code in the tool you are trying to run detects and reports an error. Go up higher in the log until you find lines that look like: [2023-07-16T16:23:54-0700] [MainThread] [I] [toil.wdl.wdltoil] Standard error at /data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/stderr.txt: And [2023-07-16T16:23:54-0700] [MainThread] [I] [toil.wdl.wdltoil] Standard output at /data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/stdout.txt: These will be followed by the standard error and standard output log data from the task's command. There may be useful information (such as an error message from the underlying tool) in there. ==Reproducing Problems== When trying to fix a failing step, it is useful to be able to run a command outside of Toil or WDL that might reproduce the problem. In addition to getting the standard output and standard error logs as described above, you may also need input files for your tool in order to do this. In the log of your failing Toil taks, look for likes like this: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-4f886176ab8344baaf17dc72fc445445/toplog.sh' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmprwhi6h3q/toplog.sh' [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmpjyksfoko/Sample.bam' ... The <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam</code> part is a Toil file ID, and it is a relative path from your <code>--jobStore</code> value to where the file is stored on disk. So if you ran the workflow with <code>--jobStore /private/groups/patenlab/anovak/jobstore</code>, you would look for this file at: /private/groups/patenlab/anovak/jobstore/files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam ==More Ways of Finding Files== Sometimes, a step might not fail, but you still might want to see the files it is using as input. If you have the job store path, you can use the <code>find</code> command to try and find the files by name. For example, if you want to look at <code>Sample.bam</code>, you can look for it like this: find /path/to/the/jobstore -name "Sample.bam" If you want to find files that were ''uploaded'' from a job, look for lines like this in the job's log: [2023-07-16T15:58:39-0700] [MainThread] [D] [toil.wdl.wdltoil] Virtualized /data/tmp/2846b6012e3e5535add03b363950dd78/cb23/197c/work/bamPerChrs/Sample.chr14.bam as WDL file toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam You can take the <code>toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam</code> URI and URL-decode it with, for example, [https://www.urldecoder.io/], getting this: toilfile:2703483274:0:files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam Then you can take the part after the last colon, <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam</code>, and that is the path relative to the job store where this file can be found. ==Using Development Versions of Toil== Sometimes, bugs will be fixed in the development version of Toil, but not released yet. To try the current development version of Toil, you can install it like this: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl,aws,google]' If you want to use a particular branch or commit, like <code>aaa451b320fc115b3563ced25cb501301cf86f90</code>, you can do: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git@aaa451b320fc115b3563ced25cb501301cf86f90#egg=toil[wdl,aws,google]' ==Frequently Asked Questions== ===I am getting warnings about <code>XDG_RUNTIME_DIR</code>=== You may be seeing warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code> This happens because Slurm is providing Toil with an <code>XDG_RUNTIME_DIR</code> environment variable that points to a directory that doesn't exist, which the XDG spec says it shouldn't be doing. This is a known bug in the GI Slurm configuration, and Toil is letting you know that it is working around it. ===Toil said it was <code>Redirecting logging</code> somewhere, but I can't find that file!=== The Toil worker process for each job will say that it is <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code>, and when running in single machine mode these messages go to the main Toil log. The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil. If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node. =Additional WDL resources= For more information on writing and running WDL workflows, see: * [https://docs.openwdl.org/en/stable/ The WDL dcoumentation] * [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube] * [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification] f5d095505d0d678d3f7e43470edf6b592d0b531e 423 422 2023-08-10T15:27:19Z Anovak 4 wikitext text/x-wiki '''Tutorial: Getting Started with WDL Workflows on Phoenix''' Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you. Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order. This tutorial will help you get started writing and running workflows. The '''Phoenix Cluster Setup''' section is specifically for the UC Santa Cruz Genomics Institute's Phoenix Slurm cluster. The other sections are broadly applicable to other environments. By the end, you will be able to run workflows on Slurm with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong. =Phoenix Cluster Setup= Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH. ==Getting VPN access== We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall. To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster. ==Connecting to Phoenix== Once you have VPN access, you can connect to the "head node" of the Phoenix cluster. This node is where everyone logs in, but you should *not* run actual work on this node; it exists only to give you access to the files on the cluster and to the commands to control cluster jobs. To connect to the head node: 1. Connect to the VPN. 2. SSH to <code>phoenix.prism</code>. At the command line, run: ssh phoenix.prism If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run: ssh flastname@phoenix.prism The first time you connect, you will see a message like: The authenticity of host 'phoenix.prism (10.50.1.66)' can't be established. ED25519 key fingerprint is SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI. This key is not known by any other names Are you sure you want to continue connecting (yes/no/[fingerprint])? This is your computer asking you to help it decide if it is talking to the genuine <code>phoenix.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is. ==Installing Toil with WDL support== Once you are on the head node, you can install Toil, a program for running workflows. When installing, you need to specify that you want WDL support. To do this, you can run: pip install --upgrade --user 'toil[wdl]' If you also want to use AWS S3 <code>s3://</code> and/or Google <code>gs://</code> URLs for data, you will need to also install Toil with the <code>aws</code> and <code>google</code> extras, respectively: pip install --upgrade --user 'toil[wdl,aws,google]' This will install Toil in the <code>.local</code> directory inside your home directory, which we write as <code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>. By default, the command interpreter *will not* look there, so if you type <code>toil-wdl-runner</code>, it will complain that the command is not found. To fix this, you need to configure the command interpreter (bash) to look where Toil is installed. To do this, run: echo 'export PATH="${HOME}/.local/bin:${PATH}"' >>~/.bashrc After that, **log out and log back in**, to restart bash and pick up the change. To make sure it worked, you can run: toil-wdl-runner --help If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports. If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above. ==Configuring Toil for Phoenix== Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. So, use these commands to make sure that Toil knows where it ought to put its caches: echo 'export SINGULARITY_CACHEDIR="${HOME}/.singularity/cache"' >>~/.bashrc echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="${HOME}/.cache/miniwdl"' >>~/.bashrc After that, '''log out and log back in again''', to apply the changes. If you don't do this, Toil will re-download each container image, on each node, for each run of each workflow. That wastes a lot of time, and can exhaust the [https://docs.docker.com/docker-hub/download-rate-limit/#whats-the-download-rate-limit-on-docker-hub limits on how many containers you are allowed to download each day], so it is important not to skip this step. =Running an existing workflow= First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project]. First, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally. wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl ==Preparing an input file== Near the top of the WDL file, there's a section like this: workflow hello_caller { input { File who } This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one. So first, we have to make that list of names. Let's make it in <code>names.txt</code> echo "Mridula Resurrección" >names.txt echo "Gershom Šarlota" >>names.txt echo "Ritchie Ravi" >>names.txt Then, we need to create an *inputs file*, which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this: echo '{"hello_caller.who": "./names.txt"}' >inputs.json Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification]. ==Testing at small scale single-machine== We are now ready to run the workflow! You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running: srun -c 2 --mem 8G --pty bash -i This will start a new shell; to leave it and go back to the head node you can use <code>exit</code>. In your new shell, run this Toil command: toil-wdl-runner self_test.wdl inputs.json -o local_run This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>. This will print a lot of logging to standard error, and to standard output it will print: {"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]} The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person. To leave your interactive Slurm session and return to the head node, use <code>exit</code>. ==Running at larger scale== Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people! Go get this handy list of people and cut it to length: wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt head -n100 1000_names.txt >100_names.txt And make a new inputs file: echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster. To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it. mkdir -p logs toil-wdl-runner --jobStore ./big_store --batchSystem slurm --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory. =Writing your own workflow= In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz]. ==Writing the file== ===Version=== All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files). Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0. So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement: version 1.0 ===Workflow Block=== Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>. version 1.0 workflow FizzBuzz { } ===Input Block=== Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } } Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is ''not'' optional, and there is no default value, then the user's inputs file ''must'' specify a value for it in order for the workflow to run. ===Body=== Now we'll start on the body of the workflow, to be inserted just after the inputs section. The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable. Array[Int] numbers = range(item_count) WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more. ===Scattering=== Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 } ===Conditionals=== Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array. Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only ''once'' in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>. Note that WDL supports conditional ''expressions'' with a <code>then</code> and an <code>else</code>, but conditional ''statements'' only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition. So first, let's handle the special cases. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. } } ===Calling Tasks=== Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later. To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]) } We can put the code into the workflow now, and set about writing the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } ===Writing Tasks=== Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>. task stringify_number { } We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections. task stringify_number { input { Int the_number } # ??? output { String the_string # = ??? } } Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string # = ??? } } Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]). task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } } We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage. Then we can put our task into our WDL file: version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } ===Output Block=== Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren't supposed to need this, but you do need it in 1.1 and Toil doesn't actually send your outputs anywhere yet if you don't have one, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } output { Array[String] fizzbuzz_results = result } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array. ==Running the Workflow== Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs: echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json Then run it on the cluster with Toil: mkdir -p logs toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json Or locally: toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json =Debugging Workflows= Sometimes, your workflow won't work. Try these ideas for figuring out what is going wrong. ==Debugging Options== When debugging a workflow, make sure to run the workflow with <code>--logDebug</code>, to set the log level to <code>DEBUG</code>, and with <code>--jobStore /some/path/to/a/shared/directory/it/can/create</code> so that the stored files shipped between jobs are in a place you can access them. When debug logging is on, the log from every Toil job is inserted in the main Toil log between these markers: =========> Toil job log is here <========= Normally, only the logs of failing jobs are reproduced like this. ==Reading the Log== When a WDL workflow fails, you are likely to see a message like this: WDL.runtime.error.CommandFailed: task command failed with exit status 1 [2023-07-16T16:23:54-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host phoenix-15.prism This means that the command line command specified by one of your WDL tasks exited with a failing (i.e. nonzero) exit code, which will happen when either the command line command is written wrong, or when the error detection code in the tool you are trying to run detects and reports an error. Go up higher in the log until you find lines that look like: [2023-07-16T16:23:54-0700] [MainThread] [I] [toil.wdl.wdltoil] Standard error at /data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/stderr.txt: And [2023-07-16T16:23:54-0700] [MainThread] [I] [toil.wdl.wdltoil] Standard output at /data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/stdout.txt: These will be followed by the standard error and standard output log data from the task's command. There may be useful information (such as an error message from the underlying tool) in there. ==Reproducing Problems== When trying to fix a failing step, it is useful to be able to run a command outside of Toil or WDL that might reproduce the problem. In addition to getting the standard output and standard error logs as described above, you may also need input files for your tool in order to do this. In the log of your failing Toil taks, look for likes like this: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-4f886176ab8344baaf17dc72fc445445/toplog.sh' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmprwhi6h3q/toplog.sh' [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmpjyksfoko/Sample.bam' ... The <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam</code> part is a Toil file ID, and it is a relative path from your <code>--jobStore</code> value to where the file is stored on disk. So if you ran the workflow with <code>--jobStore /private/groups/patenlab/anovak/jobstore</code>, you would look for this file at: /private/groups/patenlab/anovak/jobstore/files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam ==More Ways of Finding Files== Sometimes, a step might not fail, but you still might want to see the files it is using as input. If you have the job store path, you can use the <code>find</code> command to try and find the files by name. For example, if you want to look at <code>Sample.bam</code>, you can look for it like this: find /path/to/the/jobstore -name "Sample.bam" If you want to find files that were ''uploaded'' from a job, look for lines like this in the job's log: [2023-07-16T15:58:39-0700] [MainThread] [D] [toil.wdl.wdltoil] Virtualized /data/tmp/2846b6012e3e5535add03b363950dd78/cb23/197c/work/bamPerChrs/Sample.chr14.bam as WDL file toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam You can take the <code>toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam</code> URI and URL-decode it with, for example, [https://www.urldecoder.io/], getting this: toilfile:2703483274:0:files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam Then you can take the part after the last colon, <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam</code>, and that is the path relative to the job store where this file can be found. ==Using Development Versions of Toil== Sometimes, bugs will be fixed in the development version of Toil, but not released yet. To try the current development version of Toil, you can install it like this: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl,aws,google]' If you want to use a particular branch or commit, like <code>aaa451b320fc115b3563ced25cb501301cf86f90</code>, you can do: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git@aaa451b320fc115b3563ced25cb501301cf86f90#egg=toil[wdl,aws,google]' ==Frequently Asked Questions== ===I am getting warnings about <code>XDG_RUNTIME_DIR</code>=== You may be seeing warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code> This happens because Slurm is providing Toil with an <code>XDG_RUNTIME_DIR</code> environment variable that points to a directory that doesn't exist, which the XDG spec says it shouldn't be doing. This is a known bug in the GI Slurm configuration, and Toil is letting you know that it is working around it. ===Toil said it was <code>Redirecting logging</code> somewhere, but I can't find that file!=== The Toil worker process for each job will say that it is <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code>, and when running in single machine mode these messages go to the main Toil log. The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil. If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node. =Additional WDL resources= For more information on writing and running WDL workflows, see: * [https://docs.openwdl.org/en/stable/ The WDL dcoumentation] * [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube] * [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification] 8b87f461fa5c65293235e03906afb286e4069755 425 423 2023-10-20T21:21:46Z Anovak 4 Remind people where the data needs to live. wikitext text/x-wiki '''Tutorial: Getting Started with WDL Workflows on Phoenix''' Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you. Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order. This tutorial will help you get started writing and running workflows. The '''Phoenix Cluster Setup''' section is specifically for the UC Santa Cruz Genomics Institute's Phoenix Slurm cluster. The other sections are broadly applicable to other environments. By the end, you will be able to run workflows on Slurm with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong. =Phoenix Cluster Setup= Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH. ==Getting VPN access== We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall. To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster. ==Connecting to Phoenix== Once you have VPN access, you can connect to the "head node" of the Phoenix cluster. This node is where everyone logs in, but you should *not* run actual work on this node; it exists only to give you access to the files on the cluster and to the commands to control cluster jobs. To connect to the head node: 1. Connect to the VPN. 2. SSH to <code>phoenix.prism</code>. At the command line, run: ssh phoenix.prism If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run: ssh flastname@phoenix.prism The first time you connect, you will see a message like: The authenticity of host 'phoenix.prism (10.50.1.66)' can't be established. ED25519 key fingerprint is SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI. This key is not known by any other names Are you sure you want to continue connecting (yes/no/[fingerprint])? This is your computer asking you to help it decide if it is talking to the genuine <code>phoenix.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is. ==Installing Toil with WDL support== Once you are on the head node, you can install Toil, a program for running workflows. When installing, you need to specify that you want WDL support. To do this, you can run: pip install --upgrade --user 'toil[wdl]' If you also want to use AWS S3 <code>s3://</code> and/or Google <code>gs://</code> URLs for data, you will need to also install Toil with the <code>aws</code> and <code>google</code> extras, respectively: pip install --upgrade --user 'toil[wdl,aws,google]' This will install Toil in the <code>.local</code> directory inside your home directory, which we write as <code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>. By default, the command interpreter *will not* look there, so if you type <code>toil-wdl-runner</code>, it will complain that the command is not found. To fix this, you need to configure the command interpreter (bash) to look where Toil is installed. To do this, run: echo 'export PATH="${HOME}/.local/bin:${PATH}"' >>~/.bashrc After that, **log out and log back in**, to restart bash and pick up the change. To make sure it worked, you can run: toil-wdl-runner --help If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports. If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above. ==Configuring Toil for Phoenix== Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. So, use these commands to make sure that Toil knows where it ought to put its caches: echo 'export SINGULARITY_CACHEDIR="${HOME}/.singularity/cache"' >>~/.bashrc echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="${HOME}/.cache/miniwdl"' >>~/.bashrc After that, '''log out and log back in again''', to apply the changes. If you don't do this, Toil will re-download each container image, on each node, for each run of each workflow. That wastes a lot of time, and can exhaust the [https://docs.docker.com/docker-hub/download-rate-limit/#whats-the-download-rate-limit-on-docker-hub limits on how many containers you are allowed to download each day], so it is important not to skip this step. ==Configuring your Phoenix Environment== '''Do not try and store data in your home directory on Phoenix!''' The home directories are meant for code and programs. Any data worth running a workflow on should be in a directory under <code>/public/groups</code>. You will probably need to email the admins to get added to a group so you can create a directory to work in somewhere under <code>/public/groups</code>. Usually you would end up with <code>/public/groups/YOURGROUPNAME/YOURUSERNAME</code>. =Running an existing workflow= First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project]. First, go to your user directory under <code>/public/groups</code>, and make a directory to work in. cd /public/groups/YOURGROUPNAME/YOURUSERNAME mkdir workflow-test cd workflow-test Next, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally. wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl ==Preparing an input file== Near the top of the WDL file, there's a section like this: workflow hello_caller { input { File who } This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one. So first, we have to make that list of names. Let's make it in <code>names.txt</code> echo "Mridula Resurrección" >names.txt echo "Gershom Šarlota" >>names.txt echo "Ritchie Ravi" >>names.txt Then, we need to create an *inputs file*, which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this: echo '{"hello_caller.who": "./names.txt"}' >inputs.json Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification]. ==Testing at small scale single-machine== We are now ready to run the workflow! You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running: srun -c 2 --mem 8G --pty bash -i This will start a new shell; to leave it and go back to the head node you can use <code>exit</code>. In your new shell, run this Toil command: toil-wdl-runner self_test.wdl inputs.json -o local_run This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>. This will print a lot of logging to standard error, and to standard output it will print: {"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]} The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person. To leave your interactive Slurm session and return to the head node, use <code>exit</code>. ==Running at larger scale== Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people! Go get this handy list of people and cut it to length: wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt head -n100 1000_names.txt >100_names.txt And make a new inputs file: echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster. To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it. mkdir -p logs toil-wdl-runner --jobStore ./big_store --batchSystem slurm --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory. =Writing your own workflow= In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz]. ==Writing the file== ===Version=== All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files). Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0. So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement: version 1.0 ===Workflow Block=== Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>. version 1.0 workflow FizzBuzz { } ===Input Block=== Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } } Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is ''not'' optional, and there is no default value, then the user's inputs file ''must'' specify a value for it in order for the workflow to run. ===Body=== Now we'll start on the body of the workflow, to be inserted just after the inputs section. The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable. Array[Int] numbers = range(item_count) WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more. ===Scattering=== Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 } ===Conditionals=== Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array. Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only ''once'' in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>. Note that WDL supports conditional ''expressions'' with a <code>then</code> and an <code>else</code>, but conditional ''statements'' only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition. So first, let's handle the special cases. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. } } ===Calling Tasks=== Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later. To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]) } We can put the code into the workflow now, and set about writing the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } ===Writing Tasks=== Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>. task stringify_number { } We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections. task stringify_number { input { Int the_number } # ??? output { String the_string # = ??? } } Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string # = ??? } } Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]). task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } } We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage. Then we can put our task into our WDL file: version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } ===Output Block=== Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren't supposed to need this, but you do need it in 1.1 and Toil doesn't actually send your outputs anywhere yet if you don't have one, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } output { Array[String] fizzbuzz_results = result } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array. ==Running the Workflow== Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs: echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json Then run it on the cluster with Toil: mkdir -p logs toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json Or locally: toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json =Debugging Workflows= Sometimes, your workflow won't work. Try these ideas for figuring out what is going wrong. ==Debugging Options== When debugging a workflow, make sure to run the workflow with <code>--logDebug</code>, to set the log level to <code>DEBUG</code>, and with <code>--jobStore /some/path/to/a/shared/directory/it/can/create</code> so that the stored files shipped between jobs are in a place you can access them. When debug logging is on, the log from every Toil job is inserted in the main Toil log between these markers: =========> Toil job log is here <========= Normally, only the logs of failing jobs are reproduced like this. ==Reading the Log== When a WDL workflow fails, you are likely to see a message like this: WDL.runtime.error.CommandFailed: task command failed with exit status 1 [2023-07-16T16:23:54-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host phoenix-15.prism This means that the command line command specified by one of your WDL tasks exited with a failing (i.e. nonzero) exit code, which will happen when either the command line command is written wrong, or when the error detection code in the tool you are trying to run detects and reports an error. Go up higher in the log until you find lines that look like: [2023-07-16T16:23:54-0700] [MainThread] [I] [toil.wdl.wdltoil] Standard error at /data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/stderr.txt: And [2023-07-16T16:23:54-0700] [MainThread] [I] [toil.wdl.wdltoil] Standard output at /data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/stdout.txt: These will be followed by the standard error and standard output log data from the task's command. There may be useful information (such as an error message from the underlying tool) in there. ==Reproducing Problems== When trying to fix a failing step, it is useful to be able to run a command outside of Toil or WDL that might reproduce the problem. In addition to getting the standard output and standard error logs as described above, you may also need input files for your tool in order to do this. In the log of your failing Toil taks, look for likes like this: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-4f886176ab8344baaf17dc72fc445445/toplog.sh' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmprwhi6h3q/toplog.sh' [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmpjyksfoko/Sample.bam' ... The <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam</code> part is a Toil file ID, and it is a relative path from your <code>--jobStore</code> value to where the file is stored on disk. So if you ran the workflow with <code>--jobStore /private/groups/patenlab/anovak/jobstore</code>, you would look for this file at: /private/groups/patenlab/anovak/jobstore/files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam ==More Ways of Finding Files== Sometimes, a step might not fail, but you still might want to see the files it is using as input. If you have the job store path, you can use the <code>find</code> command to try and find the files by name. For example, if you want to look at <code>Sample.bam</code>, you can look for it like this: find /path/to/the/jobstore -name "Sample.bam" If you want to find files that were ''uploaded'' from a job, look for lines like this in the job's log: [2023-07-16T15:58:39-0700] [MainThread] [D] [toil.wdl.wdltoil] Virtualized /data/tmp/2846b6012e3e5535add03b363950dd78/cb23/197c/work/bamPerChrs/Sample.chr14.bam as WDL file toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam You can take the <code>toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam</code> URI and URL-decode it with, for example, [https://www.urldecoder.io/], getting this: toilfile:2703483274:0:files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam Then you can take the part after the last colon, <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam</code>, and that is the path relative to the job store where this file can be found. ==Using Development Versions of Toil== Sometimes, bugs will be fixed in the development version of Toil, but not released yet. To try the current development version of Toil, you can install it like this: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl,aws,google]' If you want to use a particular branch or commit, like <code>aaa451b320fc115b3563ced25cb501301cf86f90</code>, you can do: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git@aaa451b320fc115b3563ced25cb501301cf86f90#egg=toil[wdl,aws,google]' ==Frequently Asked Questions== ===I am getting warnings about <code>XDG_RUNTIME_DIR</code>=== You may be seeing warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code> This happens because Slurm is providing Toil with an <code>XDG_RUNTIME_DIR</code> environment variable that points to a directory that doesn't exist, which the XDG spec says it shouldn't be doing. This is a known bug in the GI Slurm configuration, and Toil is letting you know that it is working around it. ===Toil said it was <code>Redirecting logging</code> somewhere, but I can't find that file!=== The Toil worker process for each job will say that it is <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code>, and when running in single machine mode these messages go to the main Toil log. The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil. If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node. =Additional WDL resources= For more information on writing and running WDL workflows, see: * [https://docs.openwdl.org/en/stable/ The WDL dcoumentation] * [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube] * [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification] dc88d3b29b6aed2aabfa72e72b7d72db53ae2455 426 425 2023-10-23T20:24:46Z Anovak 4 Cache Singularity stuff outside the size-limited home directories. wikitext text/x-wiki '''Tutorial: Getting Started with WDL Workflows on Phoenix''' Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you. Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order. This tutorial will help you get started writing and running workflows. The '''Phoenix Cluster Setup''' section is specifically for the UC Santa Cruz Genomics Institute's Phoenix Slurm cluster. The other sections are broadly applicable to other environments. By the end, you will be able to run workflows on Slurm with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong. =Phoenix Cluster Setup= Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH. ==Getting VPN access== We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall. To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster. ==Connecting to Phoenix== Once you have VPN access, you can connect to the "head node" of the Phoenix cluster. This node is where everyone logs in, but you should *not* run actual work on this node; it exists only to give you access to the files on the cluster and to the commands to control cluster jobs. To connect to the head node: 1. Connect to the VPN. 2. SSH to <code>phoenix.prism</code>. At the command line, run: ssh phoenix.prism If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run: ssh flastname@phoenix.prism The first time you connect, you will see a message like: The authenticity of host 'phoenix.prism (10.50.1.66)' can't be established. ED25519 key fingerprint is SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI. This key is not known by any other names Are you sure you want to continue connecting (yes/no/[fingerprint])? This is your computer asking you to help it decide if it is talking to the genuine <code>phoenix.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is. ==Installing Toil with WDL support== Once you are on the head node, you can install Toil, a program for running workflows. When installing, you need to specify that you want WDL support. To do this, you can run: pip install --upgrade --user 'toil[wdl]' If you also want to use AWS S3 <code>s3://</code> and/or Google <code>gs://</code> URLs for data, you will need to also install Toil with the <code>aws</code> and <code>google</code> extras, respectively: pip install --upgrade --user 'toil[wdl,aws,google]' This will install Toil in the <code>.local</code> directory inside your home directory, which we write as <code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>. By default, the command interpreter *will not* look there, so if you type <code>toil-wdl-runner</code>, it will complain that the command is not found. To fix this, you need to configure the command interpreter (bash) to look where Toil is installed. To do this, run: echo 'export PATH="${HOME}/.local/bin:${PATH}"' >>~/.bashrc After that, **log out and log back in**, to restart bash and pick up the change. To make sure it worked, you can run: toil-wdl-runner --help If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports. If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above. ==Configuring your Phoenix Environment== '''Do not try and store data in your home directory on Phoenix!''' The home directories are meant for code and programs. Any data worth running a workflow on should be in a directory under <code>/public/groups</code>. You will probably need to email the admins to get added to a group so you can create a directory to work in somewhere under <code>/public/groups</code>. Usually you would end up with <code>/public/groups/YOURGROUPNAME/YOURUSERNAME</code>. Remember this path; we will need it later. ==Configuring Toil for Phoenix== Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. However, since these files can be large, and the home directory quota is only 30 GB, we can't keep these in your home directory. We will need to use the <code>/public/groups/YOURGROUPNAME/YOURUSERNAME</code> directory you created earlier. Make sure that that directory is available in your <code>~/.bashrc</code> file by editing and running this command: echo 'BIG_DATA_DIR=/public/groups/YOURGROUPNAME/YOURUSERNAME' >>~/.bashrc Then use these commands to make sure that Toil knows where it ought to put its caches: echo 'export SINGULARITY_CACHEDIR="${BIG_DATA_DIR}/.singularity/cache"' >>~/.bashrc echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="${BIG_DATA_DIR}/.cache/miniwdl"' >>~/.bashrc After that, '''log out and log back in again''', to apply the changes. If you don't do this, Toil will re-download each container image, on each node, for each run of each workflow. That wastes a lot of time, and can exhaust the [https://docs.docker.com/docker-hub/download-rate-limit/#whats-the-download-rate-limit-on-docker-hub limits on how many containers you are allowed to download each day], so it is important not to skip this step. =Running an existing workflow= First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project]. First, go to your user directory under <code>/public/groups</code>, and make a directory to work in. cd /public/groups/YOURGROUPNAME/YOURUSERNAME mkdir workflow-test cd workflow-test Next, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally. wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl ==Preparing an input file== Near the top of the WDL file, there's a section like this: workflow hello_caller { input { File who } This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one. So first, we have to make that list of names. Let's make it in <code>names.txt</code> echo "Mridula Resurrección" >names.txt echo "Gershom Šarlota" >>names.txt echo "Ritchie Ravi" >>names.txt Then, we need to create an *inputs file*, which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this: echo '{"hello_caller.who": "./names.txt"}' >inputs.json Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification]. ==Testing at small scale single-machine== We are now ready to run the workflow! You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running: srun -c 2 --mem 8G --pty bash -i This will start a new shell; to leave it and go back to the head node you can use <code>exit</code>. In your new shell, run this Toil command: toil-wdl-runner self_test.wdl inputs.json -o local_run This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>. This will print a lot of logging to standard error, and to standard output it will print: {"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]} The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person. To leave your interactive Slurm session and return to the head node, use <code>exit</code>. ==Running at larger scale== Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people! Go get this handy list of people and cut it to length: wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt head -n100 1000_names.txt >100_names.txt And make a new inputs file: echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster. To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it. mkdir -p logs toil-wdl-runner --jobStore ./big_store --batchSystem slurm --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory. =Writing your own workflow= In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz]. ==Writing the file== ===Version=== All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files). Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0. So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement: version 1.0 ===Workflow Block=== Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>. version 1.0 workflow FizzBuzz { } ===Input Block=== Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } } Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is ''not'' optional, and there is no default value, then the user's inputs file ''must'' specify a value for it in order for the workflow to run. ===Body=== Now we'll start on the body of the workflow, to be inserted just after the inputs section. The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable. Array[Int] numbers = range(item_count) WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more. ===Scattering=== Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 } ===Conditionals=== Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array. Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only ''once'' in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>. Note that WDL supports conditional ''expressions'' with a <code>then</code> and an <code>else</code>, but conditional ''statements'' only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition. So first, let's handle the special cases. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. } } ===Calling Tasks=== Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later. To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]) } We can put the code into the workflow now, and set about writing the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } ===Writing Tasks=== Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>. task stringify_number { } We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections. task stringify_number { input { Int the_number } # ??? output { String the_string # = ??? } } Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string # = ??? } } Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]). task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } } We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage. Then we can put our task into our WDL file: version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } ===Output Block=== Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren't supposed to need this, but you do need it in 1.1 and Toil doesn't actually send your outputs anywhere yet if you don't have one, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } output { Array[String] fizzbuzz_results = result } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array. ==Running the Workflow== Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs: echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json Then run it on the cluster with Toil: mkdir -p logs toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json Or locally: toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json =Debugging Workflows= Sometimes, your workflow won't work. Try these ideas for figuring out what is going wrong. ==Debugging Options== When debugging a workflow, make sure to run the workflow with <code>--logDebug</code>, to set the log level to <code>DEBUG</code>, and with <code>--jobStore /some/path/to/a/shared/directory/it/can/create</code> so that the stored files shipped between jobs are in a place you can access them. When debug logging is on, the log from every Toil job is inserted in the main Toil log between these markers: =========> Toil job log is here <========= Normally, only the logs of failing jobs are reproduced like this. ==Reading the Log== When a WDL workflow fails, you are likely to see a message like this: WDL.runtime.error.CommandFailed: task command failed with exit status 1 [2023-07-16T16:23:54-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host phoenix-15.prism This means that the command line command specified by one of your WDL tasks exited with a failing (i.e. nonzero) exit code, which will happen when either the command line command is written wrong, or when the error detection code in the tool you are trying to run detects and reports an error. Go up higher in the log until you find lines that look like: [2023-07-16T16:23:54-0700] [MainThread] [I] [toil.wdl.wdltoil] Standard error at /data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/stderr.txt: And [2023-07-16T16:23:54-0700] [MainThread] [I] [toil.wdl.wdltoil] Standard output at /data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/stdout.txt: These will be followed by the standard error and standard output log data from the task's command. There may be useful information (such as an error message from the underlying tool) in there. ==Reproducing Problems== When trying to fix a failing step, it is useful to be able to run a command outside of Toil or WDL that might reproduce the problem. In addition to getting the standard output and standard error logs as described above, you may also need input files for your tool in order to do this. In the log of your failing Toil taks, look for likes like this: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-4f886176ab8344baaf17dc72fc445445/toplog.sh' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmprwhi6h3q/toplog.sh' [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmpjyksfoko/Sample.bam' ... The <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam</code> part is a Toil file ID, and it is a relative path from your <code>--jobStore</code> value to where the file is stored on disk. So if you ran the workflow with <code>--jobStore /private/groups/patenlab/anovak/jobstore</code>, you would look for this file at: /private/groups/patenlab/anovak/jobstore/files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam ==More Ways of Finding Files== Sometimes, a step might not fail, but you still might want to see the files it is using as input. If you have the job store path, you can use the <code>find</code> command to try and find the files by name. For example, if you want to look at <code>Sample.bam</code>, you can look for it like this: find /path/to/the/jobstore -name "Sample.bam" If you want to find files that were ''uploaded'' from a job, look for lines like this in the job's log: [2023-07-16T15:58:39-0700] [MainThread] [D] [toil.wdl.wdltoil] Virtualized /data/tmp/2846b6012e3e5535add03b363950dd78/cb23/197c/work/bamPerChrs/Sample.chr14.bam as WDL file toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam You can take the <code>toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam</code> URI and URL-decode it with, for example, [https://www.urldecoder.io/], getting this: toilfile:2703483274:0:files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam Then you can take the part after the last colon, <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam</code>, and that is the path relative to the job store where this file can be found. ==Using Development Versions of Toil== Sometimes, bugs will be fixed in the development version of Toil, but not released yet. To try the current development version of Toil, you can install it like this: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl,aws,google]' If you want to use a particular branch or commit, like <code>aaa451b320fc115b3563ced25cb501301cf86f90</code>, you can do: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git@aaa451b320fc115b3563ced25cb501301cf86f90#egg=toil[wdl,aws,google]' ==Frequently Asked Questions== ===I am getting warnings about <code>XDG_RUNTIME_DIR</code>=== You may be seeing warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code> This happens because Slurm is providing Toil with an <code>XDG_RUNTIME_DIR</code> environment variable that points to a directory that doesn't exist, which the XDG spec says it shouldn't be doing. This is a known bug in the GI Slurm configuration, and Toil is letting you know that it is working around it. ===Toil said it was <code>Redirecting logging</code> somewhere, but I can't find that file!=== The Toil worker process for each job will say that it is <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code>, and when running in single machine mode these messages go to the main Toil log. The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil. If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node. =Additional WDL resources= For more information on writing and running WDL workflows, see: * [https://docs.openwdl.org/en/stable/ The WDL dcoumentation] * [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube] * [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification] 83bf51722b7bae0bd582cf28437c099bbae41fb7 433 426 2023-10-27T22:32:00Z Anovak 4 /* Configuring Toil for Phoenix */ Complain about file locking wikitext text/x-wiki '''Tutorial: Getting Started with WDL Workflows on Phoenix''' Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you. Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order. This tutorial will help you get started writing and running workflows. The '''Phoenix Cluster Setup''' section is specifically for the UC Santa Cruz Genomics Institute's Phoenix Slurm cluster. The other sections are broadly applicable to other environments. By the end, you will be able to run workflows on Slurm with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong. =Phoenix Cluster Setup= Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH. ==Getting VPN access== We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall. To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster. ==Connecting to Phoenix== Once you have VPN access, you can connect to the "head node" of the Phoenix cluster. This node is where everyone logs in, but you should *not* run actual work on this node; it exists only to give you access to the files on the cluster and to the commands to control cluster jobs. To connect to the head node: 1. Connect to the VPN. 2. SSH to <code>phoenix.prism</code>. At the command line, run: ssh phoenix.prism If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run: ssh flastname@phoenix.prism The first time you connect, you will see a message like: The authenticity of host 'phoenix.prism (10.50.1.66)' can't be established. ED25519 key fingerprint is SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI. This key is not known by any other names Are you sure you want to continue connecting (yes/no/[fingerprint])? This is your computer asking you to help it decide if it is talking to the genuine <code>phoenix.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is. ==Installing Toil with WDL support== Once you are on the head node, you can install Toil, a program for running workflows. When installing, you need to specify that you want WDL support. To do this, you can run: pip install --upgrade --user 'toil[wdl]' If you also want to use AWS S3 <code>s3://</code> and/or Google <code>gs://</code> URLs for data, you will need to also install Toil with the <code>aws</code> and <code>google</code> extras, respectively: pip install --upgrade --user 'toil[wdl,aws,google]' This will install Toil in the <code>.local</code> directory inside your home directory, which we write as <code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>. By default, the command interpreter *will not* look there, so if you type <code>toil-wdl-runner</code>, it will complain that the command is not found. To fix this, you need to configure the command interpreter (bash) to look where Toil is installed. To do this, run: echo 'export PATH="${HOME}/.local/bin:${PATH}"' >>~/.bashrc After that, **log out and log back in**, to restart bash and pick up the change. To make sure it worked, you can run: toil-wdl-runner --help If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports. If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above. ==Configuring your Phoenix Environment== '''Do not try and store data in your home directory on Phoenix!''' The home directories are meant for code and programs. Any data worth running a workflow on should be in a directory under <code>/public/groups</code>. You will probably need to email the admins to get added to a group so you can create a directory to work in somewhere under <code>/public/groups</code>. Usually you would end up with <code>/public/groups/YOURGROUPNAME/YOURUSERNAME</code>. Remember this path; we will need it later. ==Configuring Toil for Phoenix== Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. However, since these files can be large, and the home directory quota is only 30 GB, we can't keep these in your home directory. We will need to use the <code>/public/groups/YOURGROUPNAME/YOURUSERNAME</code> directory you created earlier. Make sure that that directory is available in your <code>~/.bashrc</code> file by editing and running this command: echo 'BIG_DATA_DIR=/public/groups/YOURGROUPNAME/YOURUSERNAME' >>~/.bashrc Then use these commands to make sure that Toil knows where it ought to put its caches: echo 'export SINGULARITY_CACHEDIR="${BIG_DATA_DIR}/.singularity/cache"' >>~/.bashrc echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="${BIG_DATA_DIR}/.cache/miniwdl"' >>~/.bashrc After that, '''log out and log back in again''', to apply the changes. If you don't do this, Toil will re-download each container image, on each node, for each run of each workflow. That wastes a lot of time, and can exhaust the [https://docs.docker.com/docker-hub/download-rate-limit/#whats-the-download-rate-limit-on-docker-hub limits on how many containers you are allowed to download each day]. '''If you get errors about mutexes, lock files, or other weird problems with Singularity''', try moving these directories to inside <code>/data/tmp</code> on the individual nodes, or unsetting them and letting Toil use its defaults (and exhaust our Docker pull limits). [https://github.com/DataBiosphere/toil/issues/4654 It is not clear that <code>/private/groups</code> actually implements the necessary file locking correctly.] =Running an existing workflow= First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project]. First, go to your user directory under <code>/public/groups</code>, and make a directory to work in. cd /public/groups/YOURGROUPNAME/YOURUSERNAME mkdir workflow-test cd workflow-test Next, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally. wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl ==Preparing an input file== Near the top of the WDL file, there's a section like this: workflow hello_caller { input { File who } This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one. So first, we have to make that list of names. Let's make it in <code>names.txt</code> echo "Mridula Resurrección" >names.txt echo "Gershom Šarlota" >>names.txt echo "Ritchie Ravi" >>names.txt Then, we need to create an *inputs file*, which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this: echo '{"hello_caller.who": "./names.txt"}' >inputs.json Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification]. ==Testing at small scale single-machine== We are now ready to run the workflow! You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running: srun -c 2 --mem 8G --pty bash -i This will start a new shell; to leave it and go back to the head node you can use <code>exit</code>. In your new shell, run this Toil command: toil-wdl-runner self_test.wdl inputs.json -o local_run This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>. This will print a lot of logging to standard error, and to standard output it will print: {"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]} The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person. To leave your interactive Slurm session and return to the head node, use <code>exit</code>. ==Running at larger scale== Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people! Go get this handy list of people and cut it to length: wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt head -n100 1000_names.txt >100_names.txt And make a new inputs file: echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster. To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it. mkdir -p logs toil-wdl-runner --jobStore ./big_store --batchSystem slurm --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory. =Writing your own workflow= In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz]. ==Writing the file== ===Version=== All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files). Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0. So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement: version 1.0 ===Workflow Block=== Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>. version 1.0 workflow FizzBuzz { } ===Input Block=== Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } } Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is ''not'' optional, and there is no default value, then the user's inputs file ''must'' specify a value for it in order for the workflow to run. ===Body=== Now we'll start on the body of the workflow, to be inserted just after the inputs section. The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable. Array[Int] numbers = range(item_count) WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more. ===Scattering=== Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 } ===Conditionals=== Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array. Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only ''once'' in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>. Note that WDL supports conditional ''expressions'' with a <code>then</code> and an <code>else</code>, but conditional ''statements'' only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition. So first, let's handle the special cases. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. } } ===Calling Tasks=== Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later. To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]) } We can put the code into the workflow now, and set about writing the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } ===Writing Tasks=== Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>. task stringify_number { } We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections. task stringify_number { input { Int the_number } # ??? output { String the_string # = ??? } } Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string # = ??? } } Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]). task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } } We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage. Then we can put our task into our WDL file: version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } ===Output Block=== Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren't supposed to need this, but you do need it in 1.1 and Toil doesn't actually send your outputs anywhere yet if you don't have one, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } output { Array[String] fizzbuzz_results = result } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array. ==Running the Workflow== Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs: echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json Then run it on the cluster with Toil: mkdir -p logs toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json Or locally: toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json =Debugging Workflows= Sometimes, your workflow won't work. Try these ideas for figuring out what is going wrong. ==Debugging Options== When debugging a workflow, make sure to run the workflow with <code>--logDebug</code>, to set the log level to <code>DEBUG</code>, and with <code>--jobStore /some/path/to/a/shared/directory/it/can/create</code> so that the stored files shipped between jobs are in a place you can access them. When debug logging is on, the log from every Toil job is inserted in the main Toil log between these markers: =========> Toil job log is here <========= Normally, only the logs of failing jobs are reproduced like this. ==Reading the Log== When a WDL workflow fails, you are likely to see a message like this: WDL.runtime.error.CommandFailed: task command failed with exit status 1 [2023-07-16T16:23:54-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host phoenix-15.prism This means that the command line command specified by one of your WDL tasks exited with a failing (i.e. nonzero) exit code, which will happen when either the command line command is written wrong, or when the error detection code in the tool you are trying to run detects and reports an error. Go up higher in the log until you find lines that look like: [2023-07-16T16:23:54-0700] [MainThread] [I] [toil.wdl.wdltoil] Standard error at /data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/stderr.txt: And [2023-07-16T16:23:54-0700] [MainThread] [I] [toil.wdl.wdltoil] Standard output at /data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/stdout.txt: These will be followed by the standard error and standard output log data from the task's command. There may be useful information (such as an error message from the underlying tool) in there. ==Reproducing Problems== When trying to fix a failing step, it is useful to be able to run a command outside of Toil or WDL that might reproduce the problem. In addition to getting the standard output and standard error logs as described above, you may also need input files for your tool in order to do this. In the log of your failing Toil taks, look for likes like this: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-4f886176ab8344baaf17dc72fc445445/toplog.sh' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmprwhi6h3q/toplog.sh' [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmpjyksfoko/Sample.bam' ... The <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam</code> part is a Toil file ID, and it is a relative path from your <code>--jobStore</code> value to where the file is stored on disk. So if you ran the workflow with <code>--jobStore /private/groups/patenlab/anovak/jobstore</code>, you would look for this file at: /private/groups/patenlab/anovak/jobstore/files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam ==More Ways of Finding Files== Sometimes, a step might not fail, but you still might want to see the files it is using as input. If you have the job store path, you can use the <code>find</code> command to try and find the files by name. For example, if you want to look at <code>Sample.bam</code>, you can look for it like this: find /path/to/the/jobstore -name "Sample.bam" If you want to find files that were ''uploaded'' from a job, look for lines like this in the job's log: [2023-07-16T15:58:39-0700] [MainThread] [D] [toil.wdl.wdltoil] Virtualized /data/tmp/2846b6012e3e5535add03b363950dd78/cb23/197c/work/bamPerChrs/Sample.chr14.bam as WDL file toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam You can take the <code>toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam</code> URI and URL-decode it with, for example, [https://www.urldecoder.io/], getting this: toilfile:2703483274:0:files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam Then you can take the part after the last colon, <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam</code>, and that is the path relative to the job store where this file can be found. ==Using Development Versions of Toil== Sometimes, bugs will be fixed in the development version of Toil, but not released yet. To try the current development version of Toil, you can install it like this: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl,aws,google]' If you want to use a particular branch or commit, like <code>aaa451b320fc115b3563ced25cb501301cf86f90</code>, you can do: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git@aaa451b320fc115b3563ced25cb501301cf86f90#egg=toil[wdl,aws,google]' ==Frequently Asked Questions== ===I am getting warnings about <code>XDG_RUNTIME_DIR</code>=== You may be seeing warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code> This happens because Slurm is providing Toil with an <code>XDG_RUNTIME_DIR</code> environment variable that points to a directory that doesn't exist, which the XDG spec says it shouldn't be doing. This is a known bug in the GI Slurm configuration, and Toil is letting you know that it is working around it. ===Toil said it was <code>Redirecting logging</code> somewhere, but I can't find that file!=== The Toil worker process for each job will say that it is <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code>, and when running in single machine mode these messages go to the main Toil log. The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil. If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node. =Additional WDL resources= For more information on writing and running WDL workflows, see: * [https://docs.openwdl.org/en/stable/ The WDL dcoumentation] * [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube] * [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification] bfa519b6e260331e61fb2363f35cedaf3d929b98 434 433 2023-11-09T22:12:25Z Anovak 4 /* Configuring Toil for Phoenix */ wikitext text/x-wiki '''Tutorial: Getting Started with WDL Workflows on Phoenix''' Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you. Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order. This tutorial will help you get started writing and running workflows. The '''Phoenix Cluster Setup''' section is specifically for the UC Santa Cruz Genomics Institute's Phoenix Slurm cluster. The other sections are broadly applicable to other environments. By the end, you will be able to run workflows on Slurm with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong. =Phoenix Cluster Setup= Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH. ==Getting VPN access== We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall. To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster. ==Connecting to Phoenix== Once you have VPN access, you can connect to the "head node" of the Phoenix cluster. This node is where everyone logs in, but you should *not* run actual work on this node; it exists only to give you access to the files on the cluster and to the commands to control cluster jobs. To connect to the head node: 1. Connect to the VPN. 2. SSH to <code>phoenix.prism</code>. At the command line, run: ssh phoenix.prism If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run: ssh flastname@phoenix.prism The first time you connect, you will see a message like: The authenticity of host 'phoenix.prism (10.50.1.66)' can't be established. ED25519 key fingerprint is SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI. This key is not known by any other names Are you sure you want to continue connecting (yes/no/[fingerprint])? This is your computer asking you to help it decide if it is talking to the genuine <code>phoenix.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is. ==Installing Toil with WDL support== Once you are on the head node, you can install Toil, a program for running workflows. When installing, you need to specify that you want WDL support. To do this, you can run: pip install --upgrade --user 'toil[wdl]' If you also want to use AWS S3 <code>s3://</code> and/or Google <code>gs://</code> URLs for data, you will need to also install Toil with the <code>aws</code> and <code>google</code> extras, respectively: pip install --upgrade --user 'toil[wdl,aws,google]' This will install Toil in the <code>.local</code> directory inside your home directory, which we write as <code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>. By default, the command interpreter *will not* look there, so if you type <code>toil-wdl-runner</code>, it will complain that the command is not found. To fix this, you need to configure the command interpreter (bash) to look where Toil is installed. To do this, run: echo 'export PATH="${HOME}/.local/bin:${PATH}"' >>~/.bashrc After that, **log out and log back in**, to restart bash and pick up the change. To make sure it worked, you can run: toil-wdl-runner --help If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports. If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above. ==Configuring your Phoenix Environment== '''Do not try and store data in your home directory on Phoenix!''' The home directories are meant for code and programs. Any data worth running a workflow on should be in a directory under <code>/public/groups</code>. You will probably need to email the admins to get added to a group so you can create a directory to work in somewhere under <code>/public/groups</code>. Usually you would end up with <code>/public/groups/YOURGROUPNAME/YOURUSERNAME</code>. Remember this path; we will need it later. ==Configuring Toil for Phoenix== Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. However, since these files can be large, and the home directory quota is only 30 GB, we can't keep these in your home directory. We will need to use the <code>/public/groups/YOURGROUPNAME/YOURUSERNAME</code> directory you created earlier. Make sure that that directory is available in your <code>~/.bashrc</code> file by editing and running this command: echo 'BIG_DATA_DIR=/public/groups/YOURGROUPNAME/YOURUSERNAME' >>~/.bashrc Then, after logging out and in again, use these commands to make sure that Toil knows where it ought to put its caches: echo 'export SINGULARITY_CACHEDIR="${BIG_DATA_DIR}/.singularity/cache"' >>~/.bashrc echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="${BIG_DATA_DIR}/.cache/miniwdl"' >>~/.bashrc After that, '''log out and log back in again''', to apply the changes. If you don't do this, Toil will re-download each container image, on each node, for each run of each workflow. That wastes a lot of time, and can exhaust the [https://docs.docker.com/docker-hub/download-rate-limit/#whats-the-download-rate-limit-on-docker-hub limits on how many containers you are allowed to download each day]. '''If you get errors about mutexes, lock files, or other weird problems with Singularity''', try moving these directories to inside <code>/data/tmp</code> on the individual nodes, or unsetting them and letting Toil use its defaults (and exhaust our Docker pull limits). [https://github.com/DataBiosphere/toil/issues/4654 It is not clear that <code>/private/groups</code> actually implements the necessary file locking correctly.] =Running an existing workflow= First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project]. First, go to your user directory under <code>/public/groups</code>, and make a directory to work in. cd /public/groups/YOURGROUPNAME/YOURUSERNAME mkdir workflow-test cd workflow-test Next, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally. wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl ==Preparing an input file== Near the top of the WDL file, there's a section like this: workflow hello_caller { input { File who } This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one. So first, we have to make that list of names. Let's make it in <code>names.txt</code> echo "Mridula Resurrección" >names.txt echo "Gershom Šarlota" >>names.txt echo "Ritchie Ravi" >>names.txt Then, we need to create an *inputs file*, which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this: echo '{"hello_caller.who": "./names.txt"}' >inputs.json Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification]. ==Testing at small scale single-machine== We are now ready to run the workflow! You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running: srun -c 2 --mem 8G --pty bash -i This will start a new shell; to leave it and go back to the head node you can use <code>exit</code>. In your new shell, run this Toil command: toil-wdl-runner self_test.wdl inputs.json -o local_run This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>. This will print a lot of logging to standard error, and to standard output it will print: {"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]} The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person. To leave your interactive Slurm session and return to the head node, use <code>exit</code>. ==Running at larger scale== Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people! Go get this handy list of people and cut it to length: wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt head -n100 1000_names.txt >100_names.txt And make a new inputs file: echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster. To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it. mkdir -p logs toil-wdl-runner --jobStore ./big_store --batchSystem slurm --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory. =Writing your own workflow= In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz]. ==Writing the file== ===Version=== All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files). Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0. So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement: version 1.0 ===Workflow Block=== Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>. version 1.0 workflow FizzBuzz { } ===Input Block=== Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } } Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is ''not'' optional, and there is no default value, then the user's inputs file ''must'' specify a value for it in order for the workflow to run. ===Body=== Now we'll start on the body of the workflow, to be inserted just after the inputs section. The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable. Array[Int] numbers = range(item_count) WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more. ===Scattering=== Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 } ===Conditionals=== Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array. Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only ''once'' in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>. Note that WDL supports conditional ''expressions'' with a <code>then</code> and an <code>else</code>, but conditional ''statements'' only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition. So first, let's handle the special cases. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. } } ===Calling Tasks=== Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later. To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]) } We can put the code into the workflow now, and set about writing the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } ===Writing Tasks=== Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>. task stringify_number { } We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections. task stringify_number { input { Int the_number } # ??? output { String the_string # = ??? } } Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string # = ??? } } Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]). task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } } We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage. Then we can put our task into our WDL file: version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } ===Output Block=== Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren't supposed to need this, but you do need it in 1.1 and Toil doesn't actually send your outputs anywhere yet if you don't have one, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } output { Array[String] fizzbuzz_results = result } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array. ==Running the Workflow== Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs: echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json Then run it on the cluster with Toil: mkdir -p logs toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json Or locally: toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json =Debugging Workflows= Sometimes, your workflow won't work. Try these ideas for figuring out what is going wrong. ==Debugging Options== When debugging a workflow, make sure to run the workflow with <code>--logDebug</code>, to set the log level to <code>DEBUG</code>, and with <code>--jobStore /some/path/to/a/shared/directory/it/can/create</code> so that the stored files shipped between jobs are in a place you can access them. When debug logging is on, the log from every Toil job is inserted in the main Toil log between these markers: =========> Toil job log is here <========= Normally, only the logs of failing jobs are reproduced like this. ==Reading the Log== When a WDL workflow fails, you are likely to see a message like this: WDL.runtime.error.CommandFailed: task command failed with exit status 1 [2023-07-16T16:23:54-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host phoenix-15.prism This means that the command line command specified by one of your WDL tasks exited with a failing (i.e. nonzero) exit code, which will happen when either the command line command is written wrong, or when the error detection code in the tool you are trying to run detects and reports an error. Go up higher in the log until you find lines that look like: [2023-07-16T16:23:54-0700] [MainThread] [I] [toil.wdl.wdltoil] Standard error at /data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/stderr.txt: And [2023-07-16T16:23:54-0700] [MainThread] [I] [toil.wdl.wdltoil] Standard output at /data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/stdout.txt: These will be followed by the standard error and standard output log data from the task's command. There may be useful information (such as an error message from the underlying tool) in there. ==Reproducing Problems== When trying to fix a failing step, it is useful to be able to run a command outside of Toil or WDL that might reproduce the problem. In addition to getting the standard output and standard error logs as described above, you may also need input files for your tool in order to do this. In the log of your failing Toil taks, look for likes like this: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-4f886176ab8344baaf17dc72fc445445/toplog.sh' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmprwhi6h3q/toplog.sh' [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmpjyksfoko/Sample.bam' ... The <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam</code> part is a Toil file ID, and it is a relative path from your <code>--jobStore</code> value to where the file is stored on disk. So if you ran the workflow with <code>--jobStore /private/groups/patenlab/anovak/jobstore</code>, you would look for this file at: /private/groups/patenlab/anovak/jobstore/files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam ==More Ways of Finding Files== Sometimes, a step might not fail, but you still might want to see the files it is using as input. If you have the job store path, you can use the <code>find</code> command to try and find the files by name. For example, if you want to look at <code>Sample.bam</code>, you can look for it like this: find /path/to/the/jobstore -name "Sample.bam" If you want to find files that were ''uploaded'' from a job, look for lines like this in the job's log: [2023-07-16T15:58:39-0700] [MainThread] [D] [toil.wdl.wdltoil] Virtualized /data/tmp/2846b6012e3e5535add03b363950dd78/cb23/197c/work/bamPerChrs/Sample.chr14.bam as WDL file toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam You can take the <code>toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam</code> URI and URL-decode it with, for example, [https://www.urldecoder.io/], getting this: toilfile:2703483274:0:files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam Then you can take the part after the last colon, <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam</code>, and that is the path relative to the job store where this file can be found. ==Using Development Versions of Toil== Sometimes, bugs will be fixed in the development version of Toil, but not released yet. To try the current development version of Toil, you can install it like this: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl,aws,google]' If you want to use a particular branch or commit, like <code>aaa451b320fc115b3563ced25cb501301cf86f90</code>, you can do: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git@aaa451b320fc115b3563ced25cb501301cf86f90#egg=toil[wdl,aws,google]' ==Frequently Asked Questions== ===I am getting warnings about <code>XDG_RUNTIME_DIR</code>=== You may be seeing warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code> This happens because Slurm is providing Toil with an <code>XDG_RUNTIME_DIR</code> environment variable that points to a directory that doesn't exist, which the XDG spec says it shouldn't be doing. This is a known bug in the GI Slurm configuration, and Toil is letting you know that it is working around it. ===Toil said it was <code>Redirecting logging</code> somewhere, but I can't find that file!=== The Toil worker process for each job will say that it is <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code>, and when running in single machine mode these messages go to the main Toil log. The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil. If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node. =Additional WDL resources= For more information on writing and running WDL workflows, see: * [https://docs.openwdl.org/en/stable/ The WDL dcoumentation] * [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube] * [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification] 4f5c6510218e6fe21190d5a87582f686160ab26f 447 434 2023-12-01T16:02:52Z Anovak 4 /* Configuring Toil for Phoenix */ wikitext text/x-wiki '''Tutorial: Getting Started with WDL Workflows on Phoenix''' Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you. Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order. This tutorial will help you get started writing and running workflows. The '''Phoenix Cluster Setup''' section is specifically for the UC Santa Cruz Genomics Institute's Phoenix Slurm cluster. The other sections are broadly applicable to other environments. By the end, you will be able to run workflows on Slurm with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong. =Phoenix Cluster Setup= Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH. ==Getting VPN access== We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall. To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster. ==Connecting to Phoenix== Once you have VPN access, you can connect to the "head node" of the Phoenix cluster. This node is where everyone logs in, but you should *not* run actual work on this node; it exists only to give you access to the files on the cluster and to the commands to control cluster jobs. To connect to the head node: 1. Connect to the VPN. 2. SSH to <code>phoenix.prism</code>. At the command line, run: ssh phoenix.prism If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run: ssh flastname@phoenix.prism The first time you connect, you will see a message like: The authenticity of host 'phoenix.prism (10.50.1.66)' can't be established. ED25519 key fingerprint is SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI. This key is not known by any other names Are you sure you want to continue connecting (yes/no/[fingerprint])? This is your computer asking you to help it decide if it is talking to the genuine <code>phoenix.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is. ==Installing Toil with WDL support== Once you are on the head node, you can install Toil, a program for running workflows. When installing, you need to specify that you want WDL support. To do this, you can run: pip install --upgrade --user 'toil[wdl]' If you also want to use AWS S3 <code>s3://</code> and/or Google <code>gs://</code> URLs for data, you will need to also install Toil with the <code>aws</code> and <code>google</code> extras, respectively: pip install --upgrade --user 'toil[wdl,aws,google]' This will install Toil in the <code>.local</code> directory inside your home directory, which we write as <code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>. By default, the command interpreter *will not* look there, so if you type <code>toil-wdl-runner</code>, it will complain that the command is not found. To fix this, you need to configure the command interpreter (bash) to look where Toil is installed. To do this, run: echo 'export PATH="${HOME}/.local/bin:${PATH}"' >>~/.bashrc After that, **log out and log back in**, to restart bash and pick up the change. To make sure it worked, you can run: toil-wdl-runner --help If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports. If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above. ==Configuring your Phoenix Environment== '''Do not try and store data in your home directory on Phoenix!''' The home directories are meant for code and programs. Any data worth running a workflow on should be in a directory under <code>/public/groups</code>. You will probably need to email the admins to get added to a group so you can create a directory to work in somewhere under <code>/public/groups</code>. Usually you would end up with <code>/public/groups/YOURGROUPNAME/YOURUSERNAME</code>. Remember this path; we will need it later. ==Configuring Toil for Phoenix== Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. However, since these files can be large, and the home directory quota is only 30 GB, we can't keep these in your home directory. We will need to use the <code>/public/groups/YOURGROUPNAME/YOURUSERNAME</code> directory you created earlier. Make sure that that directory is available in your <code>~/.bashrc</code> file by editing and running this command: echo 'BIG_DATA_DIR=/public/groups/YOURGROUPNAME/YOURUSERNAME' >>~/.bashrc Then use these commands to make sure that Toil knows where it ought to put its caches: echo 'export SINGULARITY_CACHEDIR="${BIG_DATA_DIR}/.singularity/cache"' >>~/.bashrc echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="${BIG_DATA_DIR}/.cache/miniwdl"' >>~/.bashrc After that, '''log out and log back in again''', to apply the changes. If you don't do this, Toil will re-download each container image, on each node, for each run of each workflow. That wastes a lot of time, and can exhaust the [https://docs.docker.com/docker-hub/download-rate-limit/#whats-the-download-rate-limit-on-docker-hub limits on how many containers you are allowed to download each day]. '''If you get errors about mutexes, lock files, or other weird problems with Singularity''', try moving these directories to inside <code>/data/tmp</code> on the individual nodes, or unsetting them and letting Toil use its defaults (and exhaust our Docker pull limits). [https://github.com/DataBiosphere/toil/issues/4654 It is not clear that <code>/private/groups</code> actually implements the necessary file locking correctly.] =Running an existing workflow= First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project]. First, go to your user directory under <code>/public/groups</code>, and make a directory to work in. cd /public/groups/YOURGROUPNAME/YOURUSERNAME mkdir workflow-test cd workflow-test Next, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally. wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl ==Preparing an input file== Near the top of the WDL file, there's a section like this: workflow hello_caller { input { File who } This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one. So first, we have to make that list of names. Let's make it in <code>names.txt</code> echo "Mridula Resurrección" >names.txt echo "Gershom Šarlota" >>names.txt echo "Ritchie Ravi" >>names.txt Then, we need to create an *inputs file*, which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this: echo '{"hello_caller.who": "./names.txt"}' >inputs.json Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification]. ==Testing at small scale single-machine== We are now ready to run the workflow! You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running: srun -c 2 --mem 8G --pty bash -i This will start a new shell; to leave it and go back to the head node you can use <code>exit</code>. In your new shell, run this Toil command: toil-wdl-runner self_test.wdl inputs.json -o local_run This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>. This will print a lot of logging to standard error, and to standard output it will print: {"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]} The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person. To leave your interactive Slurm session and return to the head node, use <code>exit</code>. ==Running at larger scale== Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people! Go get this handy list of people and cut it to length: wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt head -n100 1000_names.txt >100_names.txt And make a new inputs file: echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster. To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it. mkdir -p logs toil-wdl-runner --jobStore ./big_store --batchSystem slurm --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory. =Writing your own workflow= In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz]. ==Writing the file== ===Version=== All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files). Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0. So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement: version 1.0 ===Workflow Block=== Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>. version 1.0 workflow FizzBuzz { } ===Input Block=== Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } } Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is ''not'' optional, and there is no default value, then the user's inputs file ''must'' specify a value for it in order for the workflow to run. ===Body=== Now we'll start on the body of the workflow, to be inserted just after the inputs section. The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable. Array[Int] numbers = range(item_count) WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more. ===Scattering=== Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 } ===Conditionals=== Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array. Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only ''once'' in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>. Note that WDL supports conditional ''expressions'' with a <code>then</code> and an <code>else</code>, but conditional ''statements'' only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition. So first, let's handle the special cases. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. } } ===Calling Tasks=== Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later. To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]) } We can put the code into the workflow now, and set about writing the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } ===Writing Tasks=== Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>. task stringify_number { } We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections. task stringify_number { input { Int the_number } # ??? output { String the_string # = ??? } } Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string # = ??? } } Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]). task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } } We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage. Then we can put our task into our WDL file: version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } ===Output Block=== Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren't supposed to need this, but you do need it in 1.1 and Toil doesn't actually send your outputs anywhere yet if you don't have one, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } output { Array[String] fizzbuzz_results = result } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array. ==Running the Workflow== Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs: echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json Then run it on the cluster with Toil: mkdir -p logs toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json Or locally: toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json =Debugging Workflows= Sometimes, your workflow won't work. Try these ideas for figuring out what is going wrong. ==Debugging Options== When debugging a workflow, make sure to run the workflow with <code>--logDebug</code>, to set the log level to <code>DEBUG</code>, and with <code>--jobStore /some/path/to/a/shared/directory/it/can/create</code> so that the stored files shipped between jobs are in a place you can access them. When debug logging is on, the log from every Toil job is inserted in the main Toil log between these markers: =========> Toil job log is here <========= Normally, only the logs of failing jobs are reproduced like this. ==Reading the Log== When a WDL workflow fails, you are likely to see a message like this: WDL.runtime.error.CommandFailed: task command failed with exit status 1 [2023-07-16T16:23:54-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host phoenix-15.prism This means that the command line command specified by one of your WDL tasks exited with a failing (i.e. nonzero) exit code, which will happen when either the command line command is written wrong, or when the error detection code in the tool you are trying to run detects and reports an error. Go up higher in the log until you find lines that look like: [2023-07-16T16:23:54-0700] [MainThread] [I] [toil.wdl.wdltoil] Standard error at /data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/stderr.txt: And [2023-07-16T16:23:54-0700] [MainThread] [I] [toil.wdl.wdltoil] Standard output at /data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/stdout.txt: These will be followed by the standard error and standard output log data from the task's command. There may be useful information (such as an error message from the underlying tool) in there. ==Reproducing Problems== When trying to fix a failing step, it is useful to be able to run a command outside of Toil or WDL that might reproduce the problem. In addition to getting the standard output and standard error logs as described above, you may also need input files for your tool in order to do this. In the log of your failing Toil taks, look for likes like this: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-4f886176ab8344baaf17dc72fc445445/toplog.sh' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmprwhi6h3q/toplog.sh' [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmpjyksfoko/Sample.bam' ... The <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam</code> part is a Toil file ID, and it is a relative path from your <code>--jobStore</code> value to where the file is stored on disk. So if you ran the workflow with <code>--jobStore /private/groups/patenlab/anovak/jobstore</code>, you would look for this file at: /private/groups/patenlab/anovak/jobstore/files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam ==More Ways of Finding Files== Sometimes, a step might not fail, but you still might want to see the files it is using as input. If you have the job store path, you can use the <code>find</code> command to try and find the files by name. For example, if you want to look at <code>Sample.bam</code>, you can look for it like this: find /path/to/the/jobstore -name "Sample.bam" If you want to find files that were ''uploaded'' from a job, look for lines like this in the job's log: [2023-07-16T15:58:39-0700] [MainThread] [D] [toil.wdl.wdltoil] Virtualized /data/tmp/2846b6012e3e5535add03b363950dd78/cb23/197c/work/bamPerChrs/Sample.chr14.bam as WDL file toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam You can take the <code>toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam</code> URI and URL-decode it with, for example, [https://www.urldecoder.io/], getting this: toilfile:2703483274:0:files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam Then you can take the part after the last colon, <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam</code>, and that is the path relative to the job store where this file can be found. ==Using Development Versions of Toil== Sometimes, bugs will be fixed in the development version of Toil, but not released yet. To try the current development version of Toil, you can install it like this: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl,aws,google]' If you want to use a particular branch or commit, like <code>aaa451b320fc115b3563ced25cb501301cf86f90</code>, you can do: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git@aaa451b320fc115b3563ced25cb501301cf86f90#egg=toil[wdl,aws,google]' ==Frequently Asked Questions== ===I am getting warnings about <code>XDG_RUNTIME_DIR</code>=== You may be seeing warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code> This happens because Slurm is providing Toil with an <code>XDG_RUNTIME_DIR</code> environment variable that points to a directory that doesn't exist, which the XDG spec says it shouldn't be doing. This is a known bug in the GI Slurm configuration, and Toil is letting you know that it is working around it. ===Toil said it was <code>Redirecting logging</code> somewhere, but I can't find that file!=== The Toil worker process for each job will say that it is <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code>, and when running in single machine mode these messages go to the main Toil log. The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil. If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node. =Additional WDL resources= For more information on writing and running WDL workflows, see: * [https://docs.openwdl.org/en/stable/ The WDL dcoumentation] * [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube] * [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification] bfa519b6e260331e61fb2363f35cedaf3d929b98 448 447 2023-12-01T16:10:49Z Anovak 4 Show using the partitions wikitext text/x-wiki '''Tutorial: Getting Started with WDL Workflows on Phoenix''' Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you. Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order. This tutorial will help you get started writing and running workflows. The '''Phoenix Cluster Setup''' section is specifically for the UC Santa Cruz Genomics Institute's Phoenix Slurm cluster. The other sections are broadly applicable to other environments. By the end, you will be able to run workflows on Slurm with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong. =Phoenix Cluster Setup= Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH. ==Getting VPN access== We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall. To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster. ==Connecting to Phoenix== Once you have VPN access, you can connect to the "head node" of the Phoenix cluster. This node is where everyone logs in, but you should *not* run actual work on this node; it exists only to give you access to the files on the cluster and to the commands to control cluster jobs. To connect to the head node: 1. Connect to the VPN. 2. SSH to <code>phoenix.prism</code>. At the command line, run: ssh phoenix.prism If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run: ssh flastname@phoenix.prism The first time you connect, you will see a message like: The authenticity of host 'phoenix.prism (10.50.1.66)' can't be established. ED25519 key fingerprint is SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI. This key is not known by any other names Are you sure you want to continue connecting (yes/no/[fingerprint])? This is your computer asking you to help it decide if it is talking to the genuine <code>phoenix.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is. ==Installing Toil with WDL support== Once you are on the head node, you can install Toil, a program for running workflows. When installing, you need to specify that you want WDL support. To do this, you can run: pip install --upgrade --user 'toil[wdl]' If you also want to use AWS S3 <code>s3://</code> and/or Google <code>gs://</code> URLs for data, you will need to also install Toil with the <code>aws</code> and <code>google</code> extras, respectively: pip install --upgrade --user 'toil[wdl,aws,google]' This will install Toil in the <code>.local</code> directory inside your home directory, which we write as <code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>. By default, the command interpreter *will not* look there, so if you type <code>toil-wdl-runner</code>, it will complain that the command is not found. To fix this, you need to configure the command interpreter (bash) to look where Toil is installed. To do this, run: echo 'export PATH="${HOME}/.local/bin:${PATH}"' >>~/.bashrc After that, **log out and log back in**, to restart bash and pick up the change. To make sure it worked, you can run: toil-wdl-runner --help If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports. If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above. ==Configuring your Phoenix Environment== '''Do not try and store data in your home directory on Phoenix!''' The home directories are meant for code and programs. Any data worth running a workflow on should be in a directory under <code>/public/groups</code>. You will probably need to email the admins to get added to a group so you can create a directory to work in somewhere under <code>/public/groups</code>. Usually you would end up with <code>/public/groups/YOURGROUPNAME/YOURUSERNAME</code>. Remember this path; we will need it later. ==Configuring Toil for Phoenix== Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. However, since these files can be large, and the home directory quota is only 30 GB, we can't keep these in your home directory. We will need to use the <code>/public/groups/YOURGROUPNAME/YOURUSERNAME</code> directory you created earlier. Make sure that that directory is available in your <code>~/.bashrc</code> file by editing and running this command: echo 'BIG_DATA_DIR=/public/groups/YOURGROUPNAME/YOURUSERNAME' >>~/.bashrc Then use these commands to make sure that Toil knows where it ought to put its caches: echo 'export SINGULARITY_CACHEDIR="${BIG_DATA_DIR}/.singularity/cache"' >>~/.bashrc echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="${BIG_DATA_DIR}/.cache/miniwdl"' >>~/.bashrc After that, '''log out and log back in again''', to apply the changes. If you don't do this, Toil will re-download each container image, on each node, for each run of each workflow. That wastes a lot of time, and can exhaust the [https://docs.docker.com/docker-hub/download-rate-limit/#whats-the-download-rate-limit-on-docker-hub limits on how many containers you are allowed to download each day]. '''If you get errors about mutexes, lock files, or other weird problems with Singularity''', try moving these directories to inside <code>/data/tmp</code> on the individual nodes, or unsetting them and letting Toil use its defaults (and exhaust our Docker pull limits). [https://github.com/DataBiosphere/toil/issues/4654 It is not clear that <code>/private/groups</code> actually implements the necessary file locking correctly.] =Running an existing workflow= First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project]. First, go to your user directory under <code>/public/groups</code>, and make a directory to work in. cd /public/groups/YOURGROUPNAME/YOURUSERNAME mkdir workflow-test cd workflow-test Next, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally. wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl ==Preparing an input file== Near the top of the WDL file, there's a section like this: workflow hello_caller { input { File who } This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one. So first, we have to make that list of names. Let's make it in <code>names.txt</code> echo "Mridula Resurrección" >names.txt echo "Gershom Šarlota" >>names.txt echo "Ritchie Ravi" >>names.txt Then, we need to create an *inputs file*, which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this: echo '{"hello_caller.who": "./names.txt"}' >inputs.json Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification]. ==Testing at small scale single-machine== We are now ready to run the workflow! You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running: srun -c 2 --mem --time=02:00:00 --partition=medium --pty bash -i This will start a new shell that can run for 2 hours; to leave it and go back to the head node you can use <code>exit</code>. In your new shell, run this Toil command: toil-wdl-runner self_test.wdl inputs.json -o local_run This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>. This will print a lot of logging to standard error, and to standard output it will print: {"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]} The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person. To leave your interactive Slurm session and return to the head node, use <code>exit</code>. ==Running at larger scale== Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people! Go get this handy list of people and cut it to length: wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt head -n100 1000_names.txt >100_names.txt And make a new inputs file: echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster. To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it. Additionally, since [https://github.com/DataBiosphere/toil/issues/4686 Toil can't manage Slurm partitions itself], we will use the <code>TOIL_SLURM_ARGS</code> to tell Toiul how long jobs should be allowed to run for (2 hours) and what [[Slurm Queues (Partitions) and Resource Management | partition]] they should go in. mkdir -p logs TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium" toil-wdl-runner --jobStore ./big_store --batchSystem slurm --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory. =Writing your own workflow= In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz]. ==Writing the file== ===Version=== All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files). Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0. So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement: version 1.0 ===Workflow Block=== Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>. version 1.0 workflow FizzBuzz { } ===Input Block=== Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } } Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is ''not'' optional, and there is no default value, then the user's inputs file ''must'' specify a value for it in order for the workflow to run. ===Body=== Now we'll start on the body of the workflow, to be inserted just after the inputs section. The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable. Array[Int] numbers = range(item_count) WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more. ===Scattering=== Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 } ===Conditionals=== Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array. Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only ''once'' in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>. Note that WDL supports conditional ''expressions'' with a <code>then</code> and an <code>else</code>, but conditional ''statements'' only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition. So first, let's handle the special cases. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. } } ===Calling Tasks=== Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later. To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]) } We can put the code into the workflow now, and set about writing the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } ===Writing Tasks=== Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>. task stringify_number { } We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections. task stringify_number { input { Int the_number } # ??? output { String the_string # = ??? } } Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string # = ??? } } Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]). task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } } We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage. Then we can put our task into our WDL file: version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } ===Output Block=== Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren't supposed to need this, but you do need it in 1.1 and Toil doesn't actually send your outputs anywhere yet if you don't have one, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } output { Array[String] fizzbuzz_results = result } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array. ==Running the Workflow== Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs: echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json Then run it on the cluster with Toil: mkdir -p logs TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium" toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json Or locally: toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json =Debugging Workflows= Sometimes, your workflow won't work. Try these ideas for figuring out what is going wrong. ==Debugging Options== When debugging a workflow, make sure to run the workflow with <code>--logDebug</code>, to set the log level to <code>DEBUG</code>, and with <code>--jobStore /some/path/to/a/shared/directory/it/can/create</code> so that the stored files shipped between jobs are in a place you can access them. When debug logging is on, the log from every Toil job is inserted in the main Toil log between these markers: =========> Toil job log is here <========= Normally, only the logs of failing jobs are reproduced like this. ==Reading the Log== When a WDL workflow fails, you are likely to see a message like this: WDL.runtime.error.CommandFailed: task command failed with exit status 1 [2023-07-16T16:23:54-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host phoenix-15.prism This means that the command line command specified by one of your WDL tasks exited with a failing (i.e. nonzero) exit code, which will happen when either the command line command is written wrong, or when the error detection code in the tool you are trying to run detects and reports an error. Go up higher in the log until you find lines that look like: [2023-07-16T16:23:54-0700] [MainThread] [I] [toil.wdl.wdltoil] Standard error at /data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/stderr.txt: And [2023-07-16T16:23:54-0700] [MainThread] [I] [toil.wdl.wdltoil] Standard output at /data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/stdout.txt: These will be followed by the standard error and standard output log data from the task's command. There may be useful information (such as an error message from the underlying tool) in there. ==Reproducing Problems== When trying to fix a failing step, it is useful to be able to run a command outside of Toil or WDL that might reproduce the problem. In addition to getting the standard output and standard error logs as described above, you may also need input files for your tool in order to do this. In the log of your failing Toil taks, look for likes like this: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-4f886176ab8344baaf17dc72fc445445/toplog.sh' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmprwhi6h3q/toplog.sh' [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmpjyksfoko/Sample.bam' ... The <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam</code> part is a Toil file ID, and it is a relative path from your <code>--jobStore</code> value to where the file is stored on disk. So if you ran the workflow with <code>--jobStore /private/groups/patenlab/anovak/jobstore</code>, you would look for this file at: /private/groups/patenlab/anovak/jobstore/files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam ==More Ways of Finding Files== Sometimes, a step might not fail, but you still might want to see the files it is using as input. If you have the job store path, you can use the <code>find</code> command to try and find the files by name. For example, if you want to look at <code>Sample.bam</code>, you can look for it like this: find /path/to/the/jobstore -name "Sample.bam" If you want to find files that were ''uploaded'' from a job, look for lines like this in the job's log: [2023-07-16T15:58:39-0700] [MainThread] [D] [toil.wdl.wdltoil] Virtualized /data/tmp/2846b6012e3e5535add03b363950dd78/cb23/197c/work/bamPerChrs/Sample.chr14.bam as WDL file toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam You can take the <code>toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam</code> URI and URL-decode it with, for example, [https://www.urldecoder.io/], getting this: toilfile:2703483274:0:files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam Then you can take the part after the last colon, <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam</code>, and that is the path relative to the job store where this file can be found. ==Using Development Versions of Toil== Sometimes, bugs will be fixed in the development version of Toil, but not released yet. To try the current development version of Toil, you can install it like this: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl,aws,google]' If you want to use a particular branch or commit, like <code>aaa451b320fc115b3563ced25cb501301cf86f90</code>, you can do: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git@aaa451b320fc115b3563ced25cb501301cf86f90#egg=toil[wdl,aws,google]' ==Frequently Asked Questions== ===I am getting warnings about <code>XDG_RUNTIME_DIR</code>=== You may be seeing warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code> This happens because Slurm is providing Toil with an <code>XDG_RUNTIME_DIR</code> environment variable that points to a directory that doesn't exist, which the XDG spec says it shouldn't be doing. This is a known bug in the GI Slurm configuration, and Toil is letting you know that it is working around it. ===Toil said it was <code>Redirecting logging</code> somewhere, but I can't find that file!=== The Toil worker process for each job will say that it is <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code>, and when running in single machine mode these messages go to the main Toil log. The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil. If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node. =Additional WDL resources= For more information on writing and running WDL workflows, see: * [https://docs.openwdl.org/en/stable/ The WDL dcoumentation] * [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube] * [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification] 144db86efb796b7f1ceae5ab3a37b63d970257ea 449 448 2023-12-01T16:14:21Z Anovak 4 /* Testing at small scale single-machine */ wikitext text/x-wiki '''Tutorial: Getting Started with WDL Workflows on Phoenix''' Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you. Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order. This tutorial will help you get started writing and running workflows. The '''Phoenix Cluster Setup''' section is specifically for the UC Santa Cruz Genomics Institute's Phoenix Slurm cluster. The other sections are broadly applicable to other environments. By the end, you will be able to run workflows on Slurm with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong. =Phoenix Cluster Setup= Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH. ==Getting VPN access== We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall. To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster. ==Connecting to Phoenix== Once you have VPN access, you can connect to the "head node" of the Phoenix cluster. This node is where everyone logs in, but you should *not* run actual work on this node; it exists only to give you access to the files on the cluster and to the commands to control cluster jobs. To connect to the head node: 1. Connect to the VPN. 2. SSH to <code>phoenix.prism</code>. At the command line, run: ssh phoenix.prism If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run: ssh flastname@phoenix.prism The first time you connect, you will see a message like: The authenticity of host 'phoenix.prism (10.50.1.66)' can't be established. ED25519 key fingerprint is SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI. This key is not known by any other names Are you sure you want to continue connecting (yes/no/[fingerprint])? This is your computer asking you to help it decide if it is talking to the genuine <code>phoenix.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is. ==Installing Toil with WDL support== Once you are on the head node, you can install Toil, a program for running workflows. When installing, you need to specify that you want WDL support. To do this, you can run: pip install --upgrade --user 'toil[wdl]' If you also want to use AWS S3 <code>s3://</code> and/or Google <code>gs://</code> URLs for data, you will need to also install Toil with the <code>aws</code> and <code>google</code> extras, respectively: pip install --upgrade --user 'toil[wdl,aws,google]' This will install Toil in the <code>.local</code> directory inside your home directory, which we write as <code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>. By default, the command interpreter *will not* look there, so if you type <code>toil-wdl-runner</code>, it will complain that the command is not found. To fix this, you need to configure the command interpreter (bash) to look where Toil is installed. To do this, run: echo 'export PATH="${HOME}/.local/bin:${PATH}"' >>~/.bashrc After that, **log out and log back in**, to restart bash and pick up the change. To make sure it worked, you can run: toil-wdl-runner --help If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports. If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above. ==Configuring your Phoenix Environment== '''Do not try and store data in your home directory on Phoenix!''' The home directories are meant for code and programs. Any data worth running a workflow on should be in a directory under <code>/public/groups</code>. You will probably need to email the admins to get added to a group so you can create a directory to work in somewhere under <code>/public/groups</code>. Usually you would end up with <code>/public/groups/YOURGROUPNAME/YOURUSERNAME</code>. Remember this path; we will need it later. ==Configuring Toil for Phoenix== Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. However, since these files can be large, and the home directory quota is only 30 GB, we can't keep these in your home directory. We will need to use the <code>/public/groups/YOURGROUPNAME/YOURUSERNAME</code> directory you created earlier. Make sure that that directory is available in your <code>~/.bashrc</code> file by editing and running this command: echo 'BIG_DATA_DIR=/public/groups/YOURGROUPNAME/YOURUSERNAME' >>~/.bashrc Then use these commands to make sure that Toil knows where it ought to put its caches: echo 'export SINGULARITY_CACHEDIR="${BIG_DATA_DIR}/.singularity/cache"' >>~/.bashrc echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="${BIG_DATA_DIR}/.cache/miniwdl"' >>~/.bashrc After that, '''log out and log back in again''', to apply the changes. If you don't do this, Toil will re-download each container image, on each node, for each run of each workflow. That wastes a lot of time, and can exhaust the [https://docs.docker.com/docker-hub/download-rate-limit/#whats-the-download-rate-limit-on-docker-hub limits on how many containers you are allowed to download each day]. '''If you get errors about mutexes, lock files, or other weird problems with Singularity''', try moving these directories to inside <code>/data/tmp</code> on the individual nodes, or unsetting them and letting Toil use its defaults (and exhaust our Docker pull limits). [https://github.com/DataBiosphere/toil/issues/4654 It is not clear that <code>/private/groups</code> actually implements the necessary file locking correctly.] =Running an existing workflow= First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project]. First, go to your user directory under <code>/public/groups</code>, and make a directory to work in. cd /public/groups/YOURGROUPNAME/YOURUSERNAME mkdir workflow-test cd workflow-test Next, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally. wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl ==Preparing an input file== Near the top of the WDL file, there's a section like this: workflow hello_caller { input { File who } This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one. So first, we have to make that list of names. Let's make it in <code>names.txt</code> echo "Mridula Resurrección" >names.txt echo "Gershom Šarlota" >>names.txt echo "Ritchie Ravi" >>names.txt Then, we need to create an *inputs file*, which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this: echo '{"hello_caller.who": "./names.txt"}' >inputs.json Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification]. ==Testing at small scale single-machine== We are now ready to run the workflow! You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running: srun -c 2 --mem 8G --time=02:00:00 --partition=medium --pty bash -i This will start a new shell that can run for 2 hours; to leave it and go back to the head node you can use <code>exit</code>. In your new shell, run this Toil command: toil-wdl-runner self_test.wdl inputs.json -o local_run This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>. This will print a lot of logging to standard error, and to standard output it will print: {"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]} The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person. To leave your interactive Slurm session and return to the head node, use <code>exit</code>. ==Running at larger scale== Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people! Go get this handy list of people and cut it to length: wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt head -n100 1000_names.txt >100_names.txt And make a new inputs file: echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster. To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it. Additionally, since [https://github.com/DataBiosphere/toil/issues/4686 Toil can't manage Slurm partitions itself], we will use the <code>TOIL_SLURM_ARGS</code> to tell Toiul how long jobs should be allowed to run for (2 hours) and what [[Slurm Queues (Partitions) and Resource Management | partition]] they should go in. mkdir -p logs TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium" toil-wdl-runner --jobStore ./big_store --batchSystem slurm --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory. =Writing your own workflow= In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz]. ==Writing the file== ===Version=== All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files). Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0. So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement: version 1.0 ===Workflow Block=== Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>. version 1.0 workflow FizzBuzz { } ===Input Block=== Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } } Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is ''not'' optional, and there is no default value, then the user's inputs file ''must'' specify a value for it in order for the workflow to run. ===Body=== Now we'll start on the body of the workflow, to be inserted just after the inputs section. The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable. Array[Int] numbers = range(item_count) WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more. ===Scattering=== Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 } ===Conditionals=== Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array. Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only ''once'' in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>. Note that WDL supports conditional ''expressions'' with a <code>then</code> and an <code>else</code>, but conditional ''statements'' only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition. So first, let's handle the special cases. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. } } ===Calling Tasks=== Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later. To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]) } We can put the code into the workflow now, and set about writing the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } ===Writing Tasks=== Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>. task stringify_number { } We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections. task stringify_number { input { Int the_number } # ??? output { String the_string # = ??? } } Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string # = ??? } } Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]). task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } } We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage. Then we can put our task into our WDL file: version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } ===Output Block=== Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren't supposed to need this, but you do need it in 1.1 and Toil doesn't actually send your outputs anywhere yet if you don't have one, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } output { Array[String] fizzbuzz_results = result } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array. ==Running the Workflow== Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs: echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json Then run it on the cluster with Toil: mkdir -p logs TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium" toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json Or locally: toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json =Debugging Workflows= Sometimes, your workflow won't work. Try these ideas for figuring out what is going wrong. ==Debugging Options== When debugging a workflow, make sure to run the workflow with <code>--logDebug</code>, to set the log level to <code>DEBUG</code>, and with <code>--jobStore /some/path/to/a/shared/directory/it/can/create</code> so that the stored files shipped between jobs are in a place you can access them. When debug logging is on, the log from every Toil job is inserted in the main Toil log between these markers: =========> Toil job log is here <========= Normally, only the logs of failing jobs are reproduced like this. ==Reading the Log== When a WDL workflow fails, you are likely to see a message like this: WDL.runtime.error.CommandFailed: task command failed with exit status 1 [2023-07-16T16:23:54-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host phoenix-15.prism This means that the command line command specified by one of your WDL tasks exited with a failing (i.e. nonzero) exit code, which will happen when either the command line command is written wrong, or when the error detection code in the tool you are trying to run detects and reports an error. Go up higher in the log until you find lines that look like: [2023-07-16T16:23:54-0700] [MainThread] [I] [toil.wdl.wdltoil] Standard error at /data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/stderr.txt: And [2023-07-16T16:23:54-0700] [MainThread] [I] [toil.wdl.wdltoil] Standard output at /data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/stdout.txt: These will be followed by the standard error and standard output log data from the task's command. There may be useful information (such as an error message from the underlying tool) in there. ==Reproducing Problems== When trying to fix a failing step, it is useful to be able to run a command outside of Toil or WDL that might reproduce the problem. In addition to getting the standard output and standard error logs as described above, you may also need input files for your tool in order to do this. In the log of your failing Toil taks, look for likes like this: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-4f886176ab8344baaf17dc72fc445445/toplog.sh' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmprwhi6h3q/toplog.sh' [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmpjyksfoko/Sample.bam' ... The <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam</code> part is a Toil file ID, and it is a relative path from your <code>--jobStore</code> value to where the file is stored on disk. So if you ran the workflow with <code>--jobStore /private/groups/patenlab/anovak/jobstore</code>, you would look for this file at: /private/groups/patenlab/anovak/jobstore/files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam ==More Ways of Finding Files== Sometimes, a step might not fail, but you still might want to see the files it is using as input. If you have the job store path, you can use the <code>find</code> command to try and find the files by name. For example, if you want to look at <code>Sample.bam</code>, you can look for it like this: find /path/to/the/jobstore -name "Sample.bam" If you want to find files that were ''uploaded'' from a job, look for lines like this in the job's log: [2023-07-16T15:58:39-0700] [MainThread] [D] [toil.wdl.wdltoil] Virtualized /data/tmp/2846b6012e3e5535add03b363950dd78/cb23/197c/work/bamPerChrs/Sample.chr14.bam as WDL file toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam You can take the <code>toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam</code> URI and URL-decode it with, for example, [https://www.urldecoder.io/], getting this: toilfile:2703483274:0:files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam Then you can take the part after the last colon, <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam</code>, and that is the path relative to the job store where this file can be found. ==Using Development Versions of Toil== Sometimes, bugs will be fixed in the development version of Toil, but not released yet. To try the current development version of Toil, you can install it like this: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl,aws,google]' If you want to use a particular branch or commit, like <code>aaa451b320fc115b3563ced25cb501301cf86f90</code>, you can do: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git@aaa451b320fc115b3563ced25cb501301cf86f90#egg=toil[wdl,aws,google]' ==Frequently Asked Questions== ===I am getting warnings about <code>XDG_RUNTIME_DIR</code>=== You may be seeing warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code> This happens because Slurm is providing Toil with an <code>XDG_RUNTIME_DIR</code> environment variable that points to a directory that doesn't exist, which the XDG spec says it shouldn't be doing. This is a known bug in the GI Slurm configuration, and Toil is letting you know that it is working around it. ===Toil said it was <code>Redirecting logging</code> somewhere, but I can't find that file!=== The Toil worker process for each job will say that it is <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code>, and when running in single machine mode these messages go to the main Toil log. The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil. If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node. =Additional WDL resources= For more information on writing and running WDL workflows, see: * [https://docs.openwdl.org/en/stable/ The WDL dcoumentation] * [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube] * [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification] 2eead521fb99851e8c41fde6af151b986b455cc2 452 449 2023-12-07T23:00:48Z Anovak 4 Show the safer export syntax wikitext text/x-wiki '''Tutorial: Getting Started with WDL Workflows on Phoenix''' Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you. Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order. This tutorial will help you get started writing and running workflows. The '''Phoenix Cluster Setup''' section is specifically for the UC Santa Cruz Genomics Institute's Phoenix Slurm cluster. The other sections are broadly applicable to other environments. By the end, you will be able to run workflows on Slurm with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong. =Phoenix Cluster Setup= Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH. ==Getting VPN access== We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall. To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster. ==Connecting to Phoenix== Once you have VPN access, you can connect to the "head node" of the Phoenix cluster. This node is where everyone logs in, but you should *not* run actual work on this node; it exists only to give you access to the files on the cluster and to the commands to control cluster jobs. To connect to the head node: 1. Connect to the VPN. 2. SSH to <code>phoenix.prism</code>. At the command line, run: ssh phoenix.prism If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run: ssh flastname@phoenix.prism The first time you connect, you will see a message like: The authenticity of host 'phoenix.prism (10.50.1.66)' can't be established. ED25519 key fingerprint is SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI. This key is not known by any other names Are you sure you want to continue connecting (yes/no/[fingerprint])? This is your computer asking you to help it decide if it is talking to the genuine <code>phoenix.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is. ==Installing Toil with WDL support== Once you are on the head node, you can install Toil, a program for running workflows. When installing, you need to specify that you want WDL support. To do this, you can run: pip install --upgrade --user 'toil[wdl]' If you also want to use AWS S3 <code>s3://</code> and/or Google <code>gs://</code> URLs for data, you will need to also install Toil with the <code>aws</code> and <code>google</code> extras, respectively: pip install --upgrade --user 'toil[wdl,aws,google]' This will install Toil in the <code>.local</code> directory inside your home directory, which we write as <code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>. By default, the command interpreter *will not* look there, so if you type <code>toil-wdl-runner</code>, it will complain that the command is not found. To fix this, you need to configure the command interpreter (bash) to look where Toil is installed. To do this, run: echo 'export PATH="${HOME}/.local/bin:${PATH}"' >>~/.bashrc After that, **log out and log back in**, to restart bash and pick up the change. To make sure it worked, you can run: toil-wdl-runner --help If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports. If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above. ==Configuring your Phoenix Environment== '''Do not try and store data in your home directory on Phoenix!''' The home directories are meant for code and programs. Any data worth running a workflow on should be in a directory under <code>/public/groups</code>. You will probably need to email the admins to get added to a group so you can create a directory to work in somewhere under <code>/public/groups</code>. Usually you would end up with <code>/public/groups/YOURGROUPNAME/YOURUSERNAME</code>. Remember this path; we will need it later. ==Configuring Toil for Phoenix== Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. However, since these files can be large, and the home directory quota is only 30 GB, we can't keep these in your home directory. We will need to use the <code>/public/groups/YOURGROUPNAME/YOURUSERNAME</code> directory you created earlier. Make sure that that directory is available in your <code>~/.bashrc</code> file by editing and running this command: echo 'BIG_DATA_DIR=/public/groups/YOURGROUPNAME/YOURUSERNAME' >>~/.bashrc Then use these commands to make sure that Toil knows where it ought to put its caches: echo 'export SINGULARITY_CACHEDIR="${BIG_DATA_DIR}/.singularity/cache"' >>~/.bashrc echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="${BIG_DATA_DIR}/.cache/miniwdl"' >>~/.bashrc After that, '''log out and log back in again''', to apply the changes. If you don't do this, Toil will re-download each container image, on each node, for each run of each workflow. That wastes a lot of time, and can exhaust the [https://docs.docker.com/docker-hub/download-rate-limit/#whats-the-download-rate-limit-on-docker-hub limits on how many containers you are allowed to download each day]. '''If you get errors about mutexes, lock files, or other weird problems with Singularity''', try moving these directories to inside <code>/data/tmp</code> on the individual nodes, or unsetting them and letting Toil use its defaults (and exhaust our Docker pull limits). [https://github.com/DataBiosphere/toil/issues/4654 It is not clear that <code>/private/groups</code> actually implements the necessary file locking correctly.] =Running an existing workflow= First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project]. First, go to your user directory under <code>/public/groups</code>, and make a directory to work in. cd /public/groups/YOURGROUPNAME/YOURUSERNAME mkdir workflow-test cd workflow-test Next, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally. wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl ==Preparing an input file== Near the top of the WDL file, there's a section like this: workflow hello_caller { input { File who } This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one. So first, we have to make that list of names. Let's make it in <code>names.txt</code> echo "Mridula Resurrección" >names.txt echo "Gershom Šarlota" >>names.txt echo "Ritchie Ravi" >>names.txt Then, we need to create an *inputs file*, which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this: echo '{"hello_caller.who": "./names.txt"}' >inputs.json Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification]. ==Testing at small scale single-machine== We are now ready to run the workflow! You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running: srun -c 2 --mem 8G --time=02:00:00 --partition=medium --pty bash -i This will start a new shell that can run for 2 hours; to leave it and go back to the head node you can use <code>exit</code>. In your new shell, run this Toil command: toil-wdl-runner self_test.wdl inputs.json -o local_run This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>. This will print a lot of logging to standard error, and to standard output it will print: {"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]} The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person. To leave your interactive Slurm session and return to the head node, use <code>exit</code>. ==Running at larger scale== Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people! Go get this handy list of people and cut it to length: wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt head -n100 1000_names.txt >100_names.txt And make a new inputs file: echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster. To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it. Additionally, since [https://github.com/DataBiosphere/toil/issues/4686 Toil can't manage Slurm partitions itself], we will use the <code>TOIL_SLURM_ARGS</code> environment variable to tell Toil how long jobs should be allowed to run for (2 hours) and what [[Slurm Queues (Partitions) and Resource Management | partition]] they should go in. mkdir -p logs export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium" toil-wdl-runner --jobStore ./big_store --batchSystem slurm --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory. =Writing your own workflow= In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz]. ==Writing the file== ===Version=== All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files). Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0. So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement: version 1.0 ===Workflow Block=== Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>. version 1.0 workflow FizzBuzz { } ===Input Block=== Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } } Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is ''not'' optional, and there is no default value, then the user's inputs file ''must'' specify a value for it in order for the workflow to run. ===Body=== Now we'll start on the body of the workflow, to be inserted just after the inputs section. The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable. Array[Int] numbers = range(item_count) WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more. ===Scattering=== Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 } ===Conditionals=== Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array. Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only ''once'' in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>. Note that WDL supports conditional ''expressions'' with a <code>then</code> and an <code>else</code>, but conditional ''statements'' only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition. So first, let's handle the special cases. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. } } ===Calling Tasks=== Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later. To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]) } We can put the code into the workflow now, and set about writing the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } ===Writing Tasks=== Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>. task stringify_number { } We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections. task stringify_number { input { Int the_number } # ??? output { String the_string # = ??? } } Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string # = ??? } } Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]). task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } } We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage. Then we can put our task into our WDL file: version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } ===Output Block=== Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren't supposed to need this, but you do need it in 1.1 and Toil doesn't actually send your outputs anywhere yet if you don't have one, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } output { Array[String] fizzbuzz_results = result } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array. ==Running the Workflow== Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs: echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json Then run it on the cluster with Toil: mkdir -p logs export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium" toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json Or locally: toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json =Debugging Workflows= Sometimes, your workflow won't work. Try these ideas for figuring out what is going wrong. ==Debugging Options== When debugging a workflow, make sure to run the workflow with <code>--logDebug</code>, to set the log level to <code>DEBUG</code>, and with <code>--jobStore /some/path/to/a/shared/directory/it/can/create</code> so that the stored files shipped between jobs are in a place you can access them. When debug logging is on, the log from every Toil job is inserted in the main Toil log between these markers: =========> Toil job log is here <========= Normally, only the logs of failing jobs are reproduced like this. ==Reading the Log== When a WDL workflow fails, you are likely to see a message like this: WDL.runtime.error.CommandFailed: task command failed with exit status 1 [2023-07-16T16:23:54-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host phoenix-15.prism This means that the command line command specified by one of your WDL tasks exited with a failing (i.e. nonzero) exit code, which will happen when either the command line command is written wrong, or when the error detection code in the tool you are trying to run detects and reports an error. Go up higher in the log until you find lines that look like: [2023-07-16T16:23:54-0700] [MainThread] [I] [toil.wdl.wdltoil] Standard error at /data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/stderr.txt: And [2023-07-16T16:23:54-0700] [MainThread] [I] [toil.wdl.wdltoil] Standard output at /data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/stdout.txt: These will be followed by the standard error and standard output log data from the task's command. There may be useful information (such as an error message from the underlying tool) in there. ==Reproducing Problems== When trying to fix a failing step, it is useful to be able to run a command outside of Toil or WDL that might reproduce the problem. In addition to getting the standard output and standard error logs as described above, you may also need input files for your tool in order to do this. In the log of your failing Toil taks, look for likes like this: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-4f886176ab8344baaf17dc72fc445445/toplog.sh' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmprwhi6h3q/toplog.sh' [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmpjyksfoko/Sample.bam' ... The <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam</code> part is a Toil file ID, and it is a relative path from your <code>--jobStore</code> value to where the file is stored on disk. So if you ran the workflow with <code>--jobStore /private/groups/patenlab/anovak/jobstore</code>, you would look for this file at: /private/groups/patenlab/anovak/jobstore/files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam ==More Ways of Finding Files== Sometimes, a step might not fail, but you still might want to see the files it is using as input. If you have the job store path, you can use the <code>find</code> command to try and find the files by name. For example, if you want to look at <code>Sample.bam</code>, you can look for it like this: find /path/to/the/jobstore -name "Sample.bam" If you want to find files that were ''uploaded'' from a job, look for lines like this in the job's log: [2023-07-16T15:58:39-0700] [MainThread] [D] [toil.wdl.wdltoil] Virtualized /data/tmp/2846b6012e3e5535add03b363950dd78/cb23/197c/work/bamPerChrs/Sample.chr14.bam as WDL file toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam You can take the <code>toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam</code> URI and URL-decode it with, for example, [https://www.urldecoder.io/], getting this: toilfile:2703483274:0:files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam Then you can take the part after the last colon, <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam</code>, and that is the path relative to the job store where this file can be found. ==Using Development Versions of Toil== Sometimes, bugs will be fixed in the development version of Toil, but not released yet. To try the current development version of Toil, you can install it like this: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl,aws,google]' If you want to use a particular branch or commit, like <code>aaa451b320fc115b3563ced25cb501301cf86f90</code>, you can do: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git@aaa451b320fc115b3563ced25cb501301cf86f90#egg=toil[wdl,aws,google]' ==Frequently Asked Questions== ===I am getting warnings about <code>XDG_RUNTIME_DIR</code>=== You may be seeing warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code> This happens because Slurm is providing Toil with an <code>XDG_RUNTIME_DIR</code> environment variable that points to a directory that doesn't exist, which the XDG spec says it shouldn't be doing. This is a known bug in the GI Slurm configuration, and Toil is letting you know that it is working around it. ===Toil said it was <code>Redirecting logging</code> somewhere, but I can't find that file!=== The Toil worker process for each job will say that it is <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code>, and when running in single machine mode these messages go to the main Toil log. The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil. If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node. =Additional WDL resources= For more information on writing and running WDL workflows, see: * [https://docs.openwdl.org/en/stable/ The WDL dcoumentation] * [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube] * [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification] 66b57c9e84db6bc307f897df19f2cce6bf56b7d3 AWS Account List and Numbers 0 22 424 328 2023-10-12T19:52:02Z Weiler 3 wikitext text/x-wiki This is a list of our currently available AWS accounts and their account numbers: ucsc-bd2k : 862902209576 ucsc-toil-dev : 318423852362 ucsc-vg-dev : 781907127277 ucsc-platform-dev : 719818754276 comparative-genomics-dev : 162786355865 nanopore-dev : 270442831226 ucsc-cgp-production : 097093801910 platform-hca-dev : 122796619775 anvil-dev : 608666466534 gi-gateway : 652235167018 pangenomics : 422448306679 braingeneers : 443872533066 ucsctreehouse : 238605363322 ucsc-bisti-dev : 851631505710 ucsc-genome-browser : 784962239183 dockstore-dev : 635220370222 ucsc-spatial : 541180793903 platform-hca-prod : 542754589326 platform-hca-portal : 158963592881 miga-lab : 156518225147 platform-anvil-dev : 289950828509 platform-anvil-prod : 465330168186 platform-anvil-portal : 166384485414 agc-runs : 598929688444 sequencing-center-cold-store : 436140841220 f1f62d636ee4c9c01f60481107bc0c0d0eb91734 Genomics Institute Computing Information 0 6 427 375 2023-10-23T23:05:51Z Weiler 3 /* Slurm at the Genomics Institute */ wikitext text/x-wiki Welcome to the Genomic Institute Computing Information Repository! Browse the below topics for help in the area you are curious about. == GI Public Computing Environment == *[[How to access the public servers]] == GI Firewalled Computing Environment (PRISM) == *[[Access to the Firewalled Compute Servers]] *[[Firewalled Computing Resources Overview]] *[[Firewalled Environment Storage Overview]] *[[Firewalled User Account and Storage Cost]] ==VPN Access== *[[Requirement for users to get GI VPN access]] == NIH dbGaP Access Requirements == *[[Requirements for dbGaP Access]] == giCloud Openstack == *[[Overview of giCloud in the Genomics Institute]] *[[Quick Start Instructions to Get Rolling with OpenStack]] == Amazon Web Services Information == *[[Overview of Getting and Using an AWS IAM Account]] *[[AWS Account List and Numbers]] *[[AWS Shared Bucket Usage Graphs]] *[[AWS Best Practices]] *[[AWS S3 Lifecycle Management]] == Slurm at the Genomics Institute == *[[Overview of using Slurm]] *[[Cluster Etiquette]] *[[Annotated Slurm Script]] *[[Job Arrays]] *[[GPU Resources]] *[[Quick Reference Guide]] *[[Slurm Tips for vg]] *[[Slurm Tips for Toil]] *[[Using Docker under Slurm]] *[[Phoenix WDL Tutorial]] ==Kubernetes Information== *[[Computational Genomics Kubernetes Installation]] *[[Undiagnosed Disease Project Kubernetes Installation]] == Problems or technical support == If you have any problems with the GI computing environment, please send an email to '''cluster-admin@soe.ucsc.edu''' 83b135b2514a833b3345b4c7e69d3fe45c14754b 435 427 2023-11-14T21:54:50Z Weiler 3 /* Slurm at the Genomics Institute */ wikitext text/x-wiki Welcome to the Genomic Institute Computing Information Repository! Browse the below topics for help in the area you are curious about. == GI Public Computing Environment == *[[How to access the public servers]] == GI Firewalled Computing Environment (PRISM) == *[[Access to the Firewalled Compute Servers]] *[[Firewalled Computing Resources Overview]] *[[Firewalled Environment Storage Overview]] *[[Firewalled User Account and Storage Cost]] ==VPN Access== *[[Requirement for users to get GI VPN access]] == NIH dbGaP Access Requirements == *[[Requirements for dbGaP Access]] == giCloud Openstack == *[[Overview of giCloud in the Genomics Institute]] *[[Quick Start Instructions to Get Rolling with OpenStack]] == Amazon Web Services Information == *[[Overview of Getting and Using an AWS IAM Account]] *[[AWS Account List and Numbers]] *[[AWS Shared Bucket Usage Graphs]] *[[AWS Best Practices]] *[[AWS S3 Lifecycle Management]] == Slurm at the Genomics Institute == *[[Overview of using Slurm]] *[[Cluster Etiquette]] *[[Annotated Slurm Script]] *[[Job Arrays]] *[[GPU Resources]] *[[Quick Reference Guide]] *[[Slurm Queues (Partitions) and Resource Management]] *[[Slurm Tips for vg]] *[[Slurm Tips for Toil]] *[[Using Docker under Slurm]] *[[Phoenix WDL Tutorial]] ==Kubernetes Information== *[[Computational Genomics Kubernetes Installation]] *[[Undiagnosed Disease Project Kubernetes Installation]] == Problems or technical support == If you have any problems with the GI computing environment, please send an email to '''cluster-admin@soe.ucsc.edu''' 604f5ab5bf5b0ee3a4c0496982c310362f62c636 442 435 2023-11-27T03:30:04Z Weiler 3 wikitext text/x-wiki Welcome to the Genomic Institute Computing Information Repository! Browse the below topics for help in the area you are curious about. == GI Public Computing Environment == *[[How to access the public servers]] == GI Firewalled Computing Environment (PRISM) == *[[Access to the Firewalled Compute Servers]] *[[Firewalled Computing Resources Overview]] *[[Firewalled Environment Storage Overview]] *[[Firewalled User Account and Storage Cost]] ==VPN Access== *[[Requirement for users to get GI VPN access]] == NIH dbGaP Access Requirements == *[[Requirements for dbGaP Access]] == giCloud Openstack == *[[Overview of giCloud in the Genomics Institute]] *[[Quick Start Instructions to Get Rolling with OpenStack]] == Amazon Web Services Information == *[[Overview of Getting and Using an AWS IAM Account]] *[[AWS Account List and Numbers]] *[[AWS Shared Bucket Usage Graphs]] *[[AWS Best Practices]] *[[AWS S3 Lifecycle Management]] == Slurm at the Genomics Institute == *[[Overview of using Slurm]] *[[Cluster Etiquette]] *[[Annotated Slurm Script]] *[[Job Arrays]] *[[GPU Resources]] *[[Quick Reference Guide]] *[[Slurm Queues (Partitions) and Resource Management]] *[[Slurm Tips for vg]] *[[Slurm Tips for Toil]] *[[Using Docker under Slurm]] *[[Phoenix WDL Tutorial]] ==General Docker Information== *[[Running a Container as a non-root User]] == Problems or technical support == If you have any problems with the GI computing environment, please send an email to '''cluster-admin@soe.ucsc.edu''' 15aa55f5826610448296f291bbffc779e00ee310 Cluster Etiquette 0 47 428 2023-10-23T23:42:12Z Weiler 3 Created page with "Begin!" wikitext text/x-wiki Begin! 2230293d480ce70f013543beb0763396009710ab 430 428 2023-10-25T21:32:39Z Weiler 3 wikitext text/x-wiki When running jobs on the cluster, you must be very aware of how those jobs will affect other users. 1: Always test your job by running one first. Just one. Note how much RAM, how many CPU cores and how much time it takes to run. Then, when you submit 50 or 100 of those, you can specify limits in your Slurm batch file on how long the job should run, how much RAM it should use and how much time it takes. In that case, slurm can stop jibs that inadvertantly go too long or use too many resources. 2: Don't run too many jobs at once if they use a lot of disk I/O. If every job reads in a 100GB file, and you launch 20 of them, you could bring down the file server serving /private/groups. Run only maybe 5 at once in that case. You can limit your concurrent jobs by specifying something like this in your job batch file: #SBATCH --array=[1-279]%10 inputList=$1 input=$(sed -n "$SLURM_ARRAY_TASK_ID"p $inputList) some_command $input 41663ffedfc1d5fe14bc5f0e7534674d9a424a9b 431 430 2023-10-25T21:54:52Z Weiler 3 wikitext text/x-wiki When running jobs on the cluster, you must be very aware of how those jobs will affect other users. 1: Always test your job by running one first. Just one. Note how much RAM, how many CPU cores and how much time it takes to run. Then, when you submit 50 or 100 of those, you can specify limits in your Slurm batch file on how long the job should run, how much RAM it should use and how much time it takes. In that case, slurm can stop jibs that inadvertantly go too long or use too many resources. 2: Don't run too many jobs at once if they use a lot of disk I/O. If every job reads in a 100GB file, and you launch 20 of them at the same time, you could bring down the file server serving /private/groups. Run only maybe 5 at once in that case, or introduce a random delay at the start of your jobs. You can limit your concurrent jobs by specifying something like this in your job batch file: #SBATCH --array=[1-279]%10 inputList=$1 input=$(sed -n "$SLURM_ARRAY_TASK_ID"p $inputList) some_command $input 829b4395cf67c51771c1380f03023c9e5990de4f How to access the public servers 0 11 429 401 2023-10-24T17:07:59Z Weiler 3 /* Storage */ wikitext text/x-wiki == How to Gain Access to the Public Genomics Institute Compute Servers == If you need access to the Genomics Institute compute servers please complete this request form: https://app.smartsheet.com/b/form/a76dbd90ba0240ab9ea9d39b390586ce. There are two parts in this process. 1. For the user, please fill in ALL required fields and submit. 2. For the Sponsor/PI - you will receive an email from Smartsheet. Please fill in all required fields and submit. We will receive your completed request and we will create your account and go over the details via a short zoom meeting with you. == Account and Storage Cost == Costs for having an active UNIX account and for storage (per TB) are listed in this document, under "Genomics Project Support", specifically "Genomics IT Systems User Support per user" and "Genomics Data Storage per TB": https://planning.ucsc.edu/budget/rates-and-assessments/recharge-rates/docs/2021-22-approved-recharge-rates.pdf == Account Expiration == Your UNIX account will have an expiration date associated with it after creation, as requested by your sponsor. Please take note of this expiration date when your account is created. You will receive notice by email when your account is about to expire. To renew, simply ask the PI that sponsored you (which will be included in the notice) to email '''cluster-admin@soe.ucsc.edu''' requesting that your account to be renewed for another year, or any other requested amount of time. If your account expires, the account will be suspended and you will no longer be able to login or view any data you may have in our systems. Any automated scripts (owned by you) that run via cron or other mechanisms will cease to function. == Server Types and Management== You can ssh into our public compute servers via SSH: '''courtyard.gi.ucsc.edu''': 1TB RAM, 64 cores, 672GB local scratch space, Ubuntu 22.04.2 '''park.gi.ucsc.edu''': 256GB RAM, 32 cores, 5TB local scratch space, Ubuntu 22.04.2 These servers are managed by the Genomics Institute Cluster Admin group. If you need software installed on them, please make your request by emailing cluster-admin@soe.ucsc.edu. == Storage == These servers mount two types of storage; home directories and group storage directories. Your home directory will be located as "/public/home/username" and has a 30GB quota. The group storage directories are created per PI, and each group directory has a default 15TB quota (although in some cases the quota is higher). For example, if David Haussler is the PI that you report to directly, then the directory would exist as /public/groups/hausslerlab. Request access to that group directory and you will then be able to write to it. Each of those group directories are shared by the lab it belongs to, so you must be wary of everyone's data usage and share the 15TB available per group accordingly. On the compute servers you can check your group's current quota usage by using the '/usr/bin/viewquota' command. You can only check the quota of a group you are part of (you would be a member of the UNIX group of the same name). If you wanted to check the quota usage of /public/groups/hausslerlab for example, you would do: $ viewquota hausslerlab Project quota on /export (/dev/mapper/export) Project ID Used Soft Hard Warn/Grace ---------- --------------------------------- hausslerlab 1.8T 15T 16T 00 [------] == Actually Doing Work and Computing == When doing research, running jobs and the like, please be careful of your resource consumption on the server you are on. Don't run too many threads or cores at once if such a thing overruns the RAM available or the disk IO available. If you are not sure of your potential RAM, CPU or disk impact, start small with one or two processes and work your way up from there. Also, before running your stuff, check what else is already happening on the server by using the 'top' command to see who else and what else is running and what kind of resources are already being consumed. If, after starting a process, you realize that the server slows down considerably or becomes unusable, kill your processes and re-evaluate what you need to make things work. These servers are shared resources - be a good neighbor! == Serving Files to the Public via the Web == If you want to setup a web page on courtyard, or serve files over HTTP from there, do this: mkdir /public/home/''your_username''/public_html chmod 755 /public/home/''your_username''/public_html Put data in the public_html directory. The URL will be: http://public.gi.ucsc.edu/''~username''/ == /data/scratch Space on the Servers == Each server will generally have a local /data/scratch filesystem that you can use to store temporary files. '''BE ADVISED''' that /data/scratch is not backed up, and the data there could disappear in the event of a disk failure or anything else. Do not store important data there. If it is important, it should be moved somewhere else very soon after creation. ad050e29409f87b61cd5be5efc9ca7a923e8f67b GPU Resources 0 36 432 327 2023-10-26T13:51:29Z Weiler 3 wikitext text/x-wiki When submitting jobs, you can ask for GPUs in one of two ways. One is: #SBATCH --gres=gpu:1 That will ask for 1 GPU generically on a node with a free GPU. This request is more specific: #SBATCH --gres=gpu:A5500:3 That requests 3 A5500 GPUs '''only'''. We have several GPU types on the cluster which may fit your specific needs: nVidia RTX A5500 : 24GB RAM nVidia A100 : 80GB RAM For the most part, Slurm takes care of making sure that each job only sees and used the GPUs assigned to it. Within the job, '''CUDA_VISIBLE_DEVICES''' will be set in the environment, but it will always be set to a list of your requested number of GPUs, starting at 0. Slurm re-numbers the GPUs assigned to each job to appear to start at 0, within the job. If you need access to the "real" GPU numbers (to log or to pass along to Docker), they are available in the '''SLURM_JOB_GPUS''' (for '''sbatch''') or '''SLURM_STEP_GPUS''' (for '''srun''') environment variable. ==Running GPU Workloads== To actually use an nVidia GPU, you need to run a program that uses the CUDA API. There are a few ways to obtain such a program. ===Prebuilt CUDA Applications=== The Slurm cluster nodes have the nVidia drivers installed, as well as basic CUDA tools like nvidia-smi. Some projects, such as tensorflow, may ship pre-built binaries that can use CUDA. You should be able to run these binaries directly, if you download them. ===Building CUDA Applications=== The cluster nodes do not have the full CUDA Toolkit. In particular, they do not have the '''nvcc''' CUDA-enabled compiler. If you want to compile applications that use CUDA, you will need to install the development environment yourself for your user. Once you have '''nvcc''' available to your user, building CUDA applications should work. To run them, you will have to submit them as jobs, because the head node does not have a GPU. ===Containerized GPU Workloads=== Instead of directly installing binaries, or installing and using the CUDA Toolkit, it is often easiest to use containers to download a prebuilt GPU workload and all of its libraries and dependencies. THere are a few options for running containerized GPU workloads on the cluster. ====Running Containers in Singularity==== You can run containers on the cluster using Singularity, and give them access to the GPUs that Slurm has selected using the '''--nv''' option. For example: singularity pull docker://tensorflow/tensorflow:latest-gpu srun -c 8 --mem 10G --gres=gpu:1 singularity run --nv docker://tensorflow/tensorflow:latest-gpu python -c 'from tensorflow.python.client import device_lib; print(device_lib.list_local_devices())' This will produce output showing that the Tensorflow container is indeed able to talk to one GPU: INFO: Using cached SIF image 2023-05-15 11:36:33.110850: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-05-15 11:36:38.799035: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /device:GPU:0 with 22244 MB memory: -> device: 0, name: NVIDIA RTX A5500, pci bus id: 0000:03:00.0, compute capability: 8.6 [name: "/device:CPU:0" device_type: "CPU" memory_limit: 268435456 locality { } incarnation: 8527638019084870106 xla_global_id: -1 , name: "/device:GPU:0" device_type: "GPU" memory_limit: 23324655616 locality { bus_id: 1 links { } } incarnation: 1860154623440434360 physical_device_desc: "device: 0, name: NVIDIA RTX A5500, pci bus id: 0000:03:00.0, compute capability: 8.6" xla_global_id: 416903419 ] Slurm's containment of the Slurm job to the correct set of GPUs is also passed through to the Singularity container; there is no need to specifically direct Singularity to use the right GPUs unless you are doing something unusual. ====Running Containers in Slurm==== Slurm itself also supports a '''--container''' option for jobs, which allows a whole job to be run inside a container. If you are able to [https://slurm.schedmd.com/containers.html convert your container to OCI Bundle format], you can pass it directly to Slurm instead of using Singularity from inside the job. However, Docker-compatible image specifiers can't be given to Slurm, only paths to OCI bundles on disk. Stnad-alone tools to download a Docker image from Docker Hub in OCI bundle format ('''skopeo''' and '''umoci''') are not yet installed on the cluster. But the method using the '''docker''' command should work. Slurm containers ''should'' have access to their assigned GPUs, but it is not clear if tools like '''nvidia-smi''' are injected into the container, as they would be with Singularity or the nVidia Container Runtime. ====Running Containers in Docker==== You might be used to running containers with Docker, or containerized GPU workloads with the nVidia Container Runtime or Toolkit. Docker is installed on all the nodes and the daemon is running; if the '''docker''' command does not work for you, ask cluster-admin to add you to the right groups. The '''nvidia''' runtime is set up and will automatically be used. While Slurm configures each Slurm job with a cgroup that directs it to the correct GPUs, '''using Docker to run another container escapes Slurm's confinement''', and using '''--gpus=1''' will ''always'' use the ''first'' GPU in the system, whether that GPU is assigned to your job or not. When using Docker, you ''must'' consult the '''SLURM_JOB_GPUS''' (for '''sbatch''') or '''SLRUM_STEP_GPUS''' (for '''srun''') environment variable and pass that along to your container. You should also impose limits on all other resources used by your Docker container, so that your whole job stays within the resources allocated by Slurm's scheduler. (TODO: find out how cgroups handles oversubscription between a Docker container and the Slurm container that launched it). An example of a working command is: srun -c 1 --mem 4G --gres=gpu:2 bash -c 'docker run --rm --gpus=\"device=$SLURM_STEP_GPUS\" nvidia/cuda:12.1.1-base-ubuntu22.04 nvidia-smi' Note that the double-quotes are included in the argument to '''--gpus''' as seen by the Docker client, and that '''bash''' and single-quotes are used to ensure that '''$SLURM_STEP_GPUS''' is evaluated within the job itself, and not on the head node. 29a5e42ba6c3394828af4d61106b83e7be9c4af0 Slurm Queues (Partitions) and Resource Management 0 48 436 2023-11-14T22:41:11Z Weiler 3 Created page with "Due to heterogeneous workloads and different batch requirements, we have implemented partitions in slurm, which are similar to queues. Each partition has different default and maximum walltime limits (aka "runtime" limits). You will need to select a partition to launch your jobs in based on what kind of jobs they are and how long they are expected to run. {| class="wikitable" |- style="font-weight:bold;" ! Partition Name ! Default Walltime Limit ! Maximum Walltime Li..." wikitext text/x-wiki Due to heterogeneous workloads and different batch requirements, we have implemented partitions in slurm, which are similar to queues. Each partition has different default and maximum walltime limits (aka "runtime" limits). You will need to select a partition to launch your jobs in based on what kind of jobs they are and how long they are expected to run. {| class="wikitable" |- style="font-weight:bold;" ! Partition Name ! Default Walltime Limit ! Maximum Walltime Limit ! style="border-color:inherit;" | Default Partition? ! Job Priority ! Maximum Nodes Utilized |- | short | 10 minutes | 1 hour | style="border-color:inherit;" | Yes | Normal | All |- | medium | 1 hour | 12 hours | style="border-color:inherit;" | No | Normal | 15 |- | long | 12 hours | 7 days | style="border-color:inherit;" | No | Normal | 10 |- | high_priority | 10 minutes | 7 days | style="border-color:inherit;" | No | High | All<br /> |- | gpu | 10 minutes | 7 days | No | Normal | 6 |} bd990a9f72bda2a5a9c88b292445fd777f626e1e 437 436 2023-11-14T22:50:52Z Weiler 3 wikitext text/x-wiki Due to heterogeneous workloads and different batch requirements, we have implemented partitions in slurm, which are similar to queues. Each partition has different default and maximum walltime limits (aka "runtime" limits). You will need to select a partition to launch your jobs in based on what kind of jobs they are and how long they are expected to run. {| class="wikitable" |- style="font-weight:bold;" ! Partition Name ! Default Walltime Limit ! Maximum Walltime Limit ! style="border-color:inherit;" | Default Partition? ! Job Priority ! Maximum Nodes Utilized |- | short | 10 minutes | 1 hour | style="border-color:inherit;" | Yes | Normal | All |- | medium | 1 hour | 12 hours | style="border-color:inherit;" | No | Normal | 15 |- | long | 12 hours | 7 days | style="border-color:inherit;" | No | Normal | 10 |- | high_priority | 10 minutes | 7 days | style="border-color:inherit;" | No | High | All<br /> |- | gpu | 10 minutes | 7 days | No | Normal | 6 |} If you do not specify a partition to run your job in, it will automatically be assigned the "short" partition by deafult. If you do not specify a walltime value in your job submission script, it will inherit the "Default Walltime Limit" of the partition it is assigned. Therefore, it is a very good idea to specify which partition your job will go in, and you should also specify a walltime limit, otherwise your jobs will inherit the default walltime limit in the chart above. This all means that it is very important to '''TEST''' your jobs before running many of them! Submit one jobs and note how much resources it takes (RAM, CPU) and how long it takes to run. Then when you submit many of those jobs, you can correctly specify the number of CPU cores your job needs, how much RAM it needs (pad it by about 20% just in case), and how much time it needs to run (pad it by 40% to account for environmental variables like disk IO load and CPU context switching load). You can test your jobs by running one job via '''srun''' and then noting how much in resources it consumed while running (after it finishes). '''Example''' seff 769059 '''Output''' Job ID: 769059 Cluster: discovery User/Group: <user-name>/<group-name> State: COMPLETED (exit code 0) Nodes: 1 Cores per node: 16 CPU Utilized: 00:00:01 CPU Efficiency: 0.11% of 00:15:28 core-walltime Job Wall-clock time: 00:00:58 Memory Utilized: 4.79 MB Memory Efficiency: 4.79% of 100.00 MB eb857f948d96621ef297e13afe218851a64c3b41 438 437 2023-11-14T22:55:25Z Weiler 3 wikitext text/x-wiki Due to heterogeneous workloads and different batch requirements, we have implemented partitions in slurm, which are similar to queues. Each partition has different default and maximum walltime limits (aka "runtime" limits). You will need to select a partition to launch your jobs in based on what kind of jobs they are and how long they are expected to run. {| class="wikitable" |- style="font-weight:bold;" ! Partition Name ! Default Walltime Limit ! Maximum Walltime Limit ! style="border-color:inherit;" | Default Partition? ! Job Priority ! Maximum Nodes Utilized |- | short | 10 minutes | 1 hour | style="border-color:inherit;" | Yes | Normal | All |- | medium | 1 hour | 12 hours | style="border-color:inherit;" | No | Normal | 15 |- | long | 12 hours | 7 days | style="border-color:inherit;" | No | Normal | 10 |- | high_priority | 10 minutes | 7 days | style="border-color:inherit;" | No | High | All<br /> |- | gpu | 10 minutes | 7 days | No | Normal | 6 |} If you do not specify a partition to run your job in, it will automatically be assigned the "short" partition by deafult. If you do not specify a walltime value in your job submission script, it will inherit the "Default Walltime Limit" of the partition it is assigned. Therefore, it is a very good idea to specify which partition your job will go in, and you should also specify a walltime limit, otherwise your jobs will inherit the default walltime limit in the chart above. This all means that it is very important to '''TEST''' your jobs before running many of them! Submit one jobs and note how much resources it takes (RAM, CPU) and how long it takes to run. Then when you submit many of those jobs, you can correctly specify the number of CPU cores your job needs, how much RAM it needs (pad it by about 20% just in case), and how much time it needs to run (pad it by 40% to account for environmental variables like disk IO load and CPU context switching load). You can test your jobs by running one job via '''srun''' with fairly high CPU, RAM and walltime limits (just so it isn't killed due to default limits), then noting how much in resources it consumed while running (after it finishes). '''Example''' seff 769059 '''Output''' Job ID: 769059 Cluster: phoenix User/Group: <user-name>/<group-name> State: COMPLETED (exit code 0) Nodes: 1 Cores per node: 16 CPU Utilized: 00:00:01 CPU Efficiency: 0.11% of 00:15:28 core-walltime Job Wall-clock time: 00:00:58 Memory Utilized: 4.79 MB Memory Efficiency: 4.79% of 100.00 MB So if I needed to run like 1000 of these jobs, and they were all similar, I would select the "short" partition, 1 CPU core, maybe specify 8MB RAM, and maybe 90 seconds walltime limit. Note how I padded the RAM and walltime a bit to account for unexpected variable cluster conditions. 5ae27fc9563e108b3450c1d7a7fbb80aaf1e131f 439 438 2023-11-14T22:58:57Z Weiler 3 wikitext text/x-wiki == Partitions == Due to heterogeneous workloads and different batch requirements, we have implemented partitions in slurm, which are similar to queues. Each partition has different default and maximum walltime limits (aka "runtime" limits). You will need to select a partition to launch your jobs in based on what kind of jobs they are and how long they are expected to run. {| class="wikitable" |- style="font-weight:bold;" ! Partition Name ! Default Walltime Limit ! Maximum Walltime Limit ! style="border-color:inherit;" | Default Partition? ! Job Priority ! Maximum Nodes Utilized |- | short | 10 minutes | 1 hour | style="border-color:inherit;" | Yes | Normal | All |- | medium | 1 hour | 12 hours | style="border-color:inherit;" | No | Normal | 15 |- | long | 12 hours | 7 days | style="border-color:inherit;" | No | Normal | 10 |- | high_priority | 10 minutes | 7 days | style="border-color:inherit;" | No | High | All<br /> |- | gpu | 10 minutes | 7 days | No | Normal | 6 |} If you do not specify a partition to run your job in, it will automatically be assigned the "short" partition by deafult. If you do not specify a walltime value in your job submission script, it will inherit the "Default Walltime Limit" of the partition it is assigned. Therefore, it is a very good idea to specify which partition your job will go in, and you should also specify a walltime limit, otherwise your jobs will inherit the default walltime limit in the chart above. This all means that it is very important to '''TEST''' your jobs before running many of them! Submit one jobs and note how much resources it takes (RAM, CPU) and how long it takes to run. Then when you submit many of those jobs, you can correctly specify the number of CPU cores your job needs, how much RAM it needs (pad it by about 20% just in case), and how much time it needs to run (pad it by 40% to account for environmental variables like disk IO load and CPU context switching load). You can test your jobs by running one job via '''srun''' with fairly high CPU, RAM and walltime limits (just so it isn't killed due to default limits), then noting how much in resources it consumed while running (after it finishes). '''Example''' seff 769059 '''Output''' Job ID: 769059 Cluster: phoenix User/Group: <user-name>/<group-name> State: COMPLETED (exit code 0) Nodes: 1 Cores per node: 16 CPU Utilized: 00:00:01 CPU Efficiency: 0.11% of 00:15:28 core-walltime Job Wall-clock time: 00:00:58 Memory Utilized: 4.79 MB Memory Efficiency: 4.79% of 100.00 MB So if I needed to run like 1000 of these jobs, and they were all similar, I would select the "short" partition, 1 CPU core, maybe specify 8MB RAM, and maybe 90 seconds walltime limit. Note how I padded the RAM and walltime a bit to account for unexpected variable cluster conditions. == '''high_priority''' Partition Notes == The "high_priority" partition is special in that it will have the highest priority of all jobs on the cluster and will push all other jobs aside in an effort to finish jobs in that partition as fast as possible. This is only available for emergency or mission critical batches that need to be completed in an unexpectedly critically fast way. Access to this partition is only granted on a per request basis, and is temporary until your batch finishes. Email '''cluster-admin@soe.ucsc.edu''' if you need to access the high_priority queue and make your case why it is necessary. 40cdec882bb5d7f6882fb5c89ce68dd488f1de5f 440 439 2023-11-14T23:25:05Z Weiler 3 wikitext text/x-wiki == Partitions == Due to heterogeneous workloads and different batch requirements, we have implemented partitions in slurm, which are similar to queues. Each partition has different default and maximum walltime limits (aka "runtime" limits). You will need to select a partition to launch your jobs in based on what kind of jobs they are and how long they are expected to run. {| class="wikitable" |- style="font-weight:bold;" ! Partition Name ! Default Walltime Limit ! Maximum Walltime Limit ! style="border-color:inherit;" | Default Partition? ! Job Priority ! Maximum Nodes Utilized |- | short | 10 minutes | 1 hour | style="border-color:inherit;" | Yes | Normal | All |- | medium | 1 hour | 12 hours | style="border-color:inherit;" | No | Normal | 15 |- | long | 12 hours | 7 days | style="border-color:inherit;" | No | Normal | 10 |- | high_priority | 10 minutes | 7 days | style="border-color:inherit;" | No | High | All<br /> |- | gpu | 10 minutes | 7 days | No | Normal | 6 |} If you do not specify a partition to run your job in, it will automatically be assigned the "short" partition by deafult. If you do not specify a walltime value in your job submission script, it will inherit the "Default Walltime Limit" of the partition it is assigned. Therefore, it is a very good idea to specify which partition your job will go in, and you should also specify a walltime limit, otherwise your jobs will inherit the default walltime limit in the chart above. This all means that it is very important to '''TEST''' your jobs before running many of them! Submit one job and note how much resources it takes (RAM, CPU) and how long it takes to run. Then when you submit many of those jobs, you can correctly specify the number of CPU cores your job needs, how much RAM it needs (pad it by about 20% just in case), and how much time it needs to run (pad it by 40% to account for environmental variables like disk IO load and CPU context switching load). You can test your jobs by running one job via '''srun''' with fairly high CPU, RAM and walltime limits (just so it isn't killed due to default limits), then noting how much in resources it consumed while running (after it finishes). '''Example''' seff 769059 '''Output''' Job ID: 769059 Cluster: phoenix User/Group: <user-name>/<group-name> State: COMPLETED (exit code 0) Nodes: 1 Cores per node: 16 CPU Utilized: 00:00:01 CPU Efficiency: 0.11% of 00:15:28 core-walltime Job Wall-clock time: 00:00:58 Memory Utilized: 4.79 MB Memory Efficiency: 4.79% of 100.00 MB So if I needed to run like 1000 of these jobs, and they were all similar, I would select the "short" partition, 1 CPU core, maybe specify 8MB RAM, and maybe 90 seconds walltime limit. Note how I padded the RAM and walltime a bit to account for unexpected variable cluster conditions. == '''high_priority''' Partition Notes == The "high_priority" partition is special in that it will have the highest priority of all jobs on the cluster and will push all other jobs aside in an effort to finish jobs in that partition as fast as possible. This is only available for emergency or mission critical batches that need to be completed in an unexpectedly critically fast way. Access to this partition is only granted on a per request basis, and is temporary until your batch finishes. Email '''cluster-admin@soe.ucsc.edu''' if you need to access the high_priority queue and make your case why it is necessary. 466e9dce7934b6df5586877e4f7af8f6676ea103 446 440 2023-11-29T18:07:38Z Weiler 3 wikitext text/x-wiki == Partitions == Due to heterogeneous workloads and different batch requirements, we have implemented partitions in slurm, which are similar to queues. Each partition has different default and maximum walltime limits (aka "runtime" limits). You will need to select a partition to launch your jobs in based on what kind of jobs they are and how long they are expected to run. {| class="wikitable" |- style="font-weight:bold;" ! Partition Name ! Default Walltime Limit ! Maximum Walltime Limit ! style="border-color:inherit;" | Default Partition? ! Job Priority ! Maximum Nodes Utilized |- | short | 10 minutes | 1 hour | style="border-color:inherit;" | Yes | Normal | All |- | medium | 1 hour | 12 hours | style="border-color:inherit;" | No | Normal | 15 |- | long | 12 hours | 7 days | style="border-color:inherit;" | No | Normal | 10 |- | high_priority | 10 minutes | 7 days | style="border-color:inherit;" | No | High | All<br /> |- | gpu | 10 minutes | 7 days | No | Normal | 6 |} If you do not specify a partition to run your job in, it will automatically be assigned the "short" partition by default. If you do not specify a walltime value in your job submission script, it will inherit the "Default Walltime Limit" of the partition it is assigned. Therefore, it is a very good idea to specify which partition your job will go in, and you should also specify a walltime limit, otherwise your jobs will inherit the default walltime limit in the chart above. This all means that it is very important to '''TEST''' your jobs before running many of them! Submit one job and note how much resources it takes (RAM, CPU) and how long it takes to run. Then when you submit many of those jobs, you can correctly specify the number of CPU cores your job needs, how much RAM it needs (pad it by about 20% just in case), and how much time it needs to run (pad it by 40% to account for environmental variables like disk IO load and CPU context switching load). You can test your jobs by running one job via '''srun''' with fairly high CPU, RAM and walltime limits (just so it isn't killed due to default limits), then noting how much in resources it consumed while running (after it finishes). '''Example''' seff 769059 '''Output''' Job ID: 769059 Cluster: phoenix User/Group: <user-name>/<group-name> State: COMPLETED (exit code 0) Nodes: 1 Cores per node: 16 CPU Utilized: 00:00:01 CPU Efficiency: 0.11% of 00:15:28 core-walltime Job Wall-clock time: 00:00:58 Memory Utilized: 4.79 MB Memory Efficiency: 4.79% of 100.00 MB So if I needed to run like 1000 of these jobs, and they were all similar, I would select the "short" partition, 1 CPU core, maybe specify 8MB RAM, and maybe 90 seconds walltime limit. Note how I padded the RAM and walltime a bit to account for unexpected variable cluster conditions. == '''high_priority''' Partition Notes == The "high_priority" partition is special in that it will have the highest priority of all jobs on the cluster and will push all other jobs aside in an effort to finish jobs in that partition as fast as possible. This is only available for emergency or mission critical batches that need to be completed in an unexpectedly critically fast way. Access to this partition is only granted on a per request basis, and is temporary until your batch finishes. Email '''cluster-admin@soe.ucsc.edu''' if you need to access the high_priority queue and make your case why it is necessary. 250d6cbc10d82c5b9e45d1921dce1c3a8126f0eb Slurm Tips for Toil 0 38 441 371 2023-11-20T21:54:12Z Anovak 4 Show how to install with extras and a branch wikitext text/x-wiki Here are some tips for running Toil workflows on the Phoenix Slurm cluster. Mostly you might want to run WDL workflows, but you can use some of these for other workflows like Cactus. You can also consult [https://github.com/DataBiosphere/toil/blob/master/docs/running/wdl.rst#running-wdl-with-toil the Toil documentation on WDL workflows]. * Install Toil with WDL support with: pip3 install --upgrade toil[wdl] To use a development version of Toil, you can install from source instead: pip3 install git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl] Or for a particular branch: pip3 install git+https://github.com/DataBiosphere/toil.git@issues/123-abc#egg=toil[wdl] * You will then need to make sure your '''~/.local/bin''' directory is on your PATH. Open up your '''~/.bashrc''' file and add: export PATH=$PATH:$HOME/.local/bin Then make sure to log out and back in again. * For Toil options, you will want '''--batchSystem slurm''' to make it use Slurm and '''--batchLogsDir ./logs''' (or some other location on a shared filesystem) for the Slurm logs to not get lost. * You may be able to speed up your workflow with '''--caching true''', to cache data on nodes to be shared among multiple simultaneous tasks. * If using '''toil-wdl-runner''', you might want to add '''--jobStore ./jobStore''' to make sure the job store is in a defined, shared location so that you can use '''--restart''' later. * If using '''toil-wdl-runner''', you will want to set the '''SINGULARITY_CACHEDIR''' and '''MINIWDL__SINGULARITY__IMAGE_CACHE''' environment variables for your workflow to locations on shared storage, and possibly to the default cache locations in your home directory. Otherwise Toil will set them to node-local directories for each node, and thus re-download images for each workflow run, and for each cluster node. To avoid this, you could, for example, before your run or in your '''~/.bashrc''' you could: export SINGULARITY_CACHEDIR=$HOME/.singularity/cache export MINIWDL__SINGULARITY__IMAGE_CACHE=$HOME/.cache/miniwdl a349216df5a7dcd00c1b6c5a35c0bc2e0a5f6619 Running a Container as a non-root User 0 49 443 2023-11-27T03:38:05Z Weiler 3 Created page with "Information here pulled from an article by Lucas Wilson-Richter on medium.com: https://medium.com/redbubble/running-a-docker-container-as-a-non-root-user-7d2e00f8ee15 =='''The Problem: Docker writes files as root'''== Sometimes, when we run builds in Docker containers, the build creates files in a folder that’s mounted into the container from the host (e.g. the source code directory). This can cause us pain, because those files will be owned by the root user. When..." wikitext text/x-wiki Information here pulled from an article by Lucas Wilson-Richter on medium.com: https://medium.com/redbubble/running-a-docker-container-as-a-non-root-user-7d2e00f8ee15 =='''The Problem: Docker writes files as root'''== Sometimes, when we run builds in Docker containers, the build creates files in a folder that’s mounted into the container from the host (e.g. the source code directory). This can cause us pain, because those files will be owned by the root user. When an ordinary user tries to clean those files up when preparing for the next build (for example by using git clean), they get an error and our build fails. There are a few ways we could deal with this problem: * We could try to prevent the build from creating any files, but that’s very limiting — we lose the ability to generate assets, or write any data to the disk. This is definitely too restrictive to solve the problem in a way that I could use with any build. * We could tell Git to ignore the affected files, but that carries the risk that they’ll hang around in the file system and have an effect on future builds. We’ve encountered that problem in the past at Redbubble, so we are wary about letting that happen again. * We could clean up the files at the end of the build, while we’re still running our Dockerised process. But that would require us to implement lots of error trapping logic to ensure the cleanup happens, but still exit the build with the correct result. It would be more elegant if we could simply create files in a way that allows ordinary users to delete them. For example, we could tell Docker to run as an ordinary user instead of root. =='''Time to be someone else'''== Fortunately, docker run gives us a way to do this: the --user parameter. We're going to use it to specify the user ID (UID) and group ID (GID) that Docker should use. This works because Docker containers all share the same kernel, and therefore the same list of UIDs and GIDs, even if the associated usernames are not known to the containers (more on that later). To run our asset build, we could use a command something like this: docker container run --rm -it \ -v $(app):/app \ # Mount the source code --workdir /app \ # Set the working dir --user 1000:1000 \ # Run as the given user my-docker/my-build-environment:latest \ # Our build env image make assets # ... and the command! This will tell Docker to run its processes with user ID 1000 and group ID 1000. That will mean that any files created by that process also belong to the user with ID 1000. =='''But I just want to be me!'''== But what if we don’t know the current user’s ID? Is there some way to automatically discover that? There is: id is a program for finding out exactly this information. We can use it with the -u switch to get the UID, and the -g switch to get the GID. So instead of setting --user 1000:1000, we could use subshells to set --user $(id -u):$(id -g). That way, we can always use the current user's UID and GID. =='''docker-compose'''== We often like to run our tests and things using docker-compose, so that we can spin up any required services as needed - databases and so on. So wouldn't it be nice if we could do this with docker-compose as well? Unfortunately, we can’t use subshells in a compose file — it’s not a supported part of the format. Lucky for us, we can insert environment variables. So if we have a docker-compose.yml like this: # This is an abbreviated example docker-compose.yml version: '3.3' services: rspec: image: my-docker/my-build-environment:latest environment: - RAILS_ENV=test command: ["make", "assets"] # THIS BIT!!!1! user: ${CURRENT_UID} volumes: - .:/app We could use a little bash to set that variable and start docker-compose: CURRENT_UID=$(id -u):$(id -g) docker-compose up Et voila! Our Dockerised script will create files as if it were the host user! =='''Gotchas'''== '''Your user will be $HOME-less.''' What we’re actually doing here is asking our Docker container to do things using the ID of a user it knows nothing about, and that creates some complications. Namely, it means that the user is missing some of the things we’ve learned to simply expect users to have — things like a home directory. This can be troublesome, because it means that all the things that live in $HOME — temporary files, application settings, package caches — now have nowhere to live. The containerised process just has no way to know where to put them. This can impact us when we’re trying to do user-specific things. We found that it caused problems using gem install (though using Bundler is OK), or running code that relies on ENV['HOME']. So it may mean that you need to make some adjustments if you do either of those things. '''Your user will be nameless, too''' It also turns out that we can’t easily share usernames between a Docker host and its containers. That’s why we can’t just use docker run --user=$(whoami) — the container doesn't know about your username. It can only find out about your user by its UID. That means that when you run whoami inside your container, you'll get a result like I have no name!. That's entertaining, but if your code relies on knowing your username, you might get some confusing results. '''Wrapping Up''' We now have a way to use docker run and docker-compose to create files, without having to use sudo to clean them up! Happy building! e3a4c6e373cdab0e2d7380222d55a8d27ffaa30c 444 443 2023-11-27T03:46:36Z Weiler 3 wikitext text/x-wiki Information here pulled from an article by Lucas Wilson-Richter on medium.com: https://medium.com/redbubble/running-a-docker-container-as-a-non-root-user-7d2e00f8ee15 =='''The Problem: Docker writes files as root'''== Sometimes, when we run builds in Docker containers, the build creates files in a folder that’s mounted into the container from the host (e.g. the source code directory). This can cause us pain, because those files will be owned by the root user. When an ordinary user tries to clean those files up when preparing for the next build (for example by using git clean), they get an error and our build fails. There are a few ways we could deal with this problem: * We could try to prevent the build from creating any files, but that’s very limiting — we lose the ability to generate assets, or write any data to the disk. This is definitely too restrictive to solve the problem in a way that I could use with any build. * We could tell Git to ignore the affected files, but that carries the risk that they’ll hang around in the file system and have an effect on future builds. We’ve encountered that problem in the past at Redbubble, so we are wary about letting that happen again. * We could clean up the files at the end of the build, while we’re still running our Dockerised process. But that would require us to implement lots of error trapping logic to ensure the cleanup happens, but still exit the build with the correct result. It would be more elegant if we could simply create files in a way that allows ordinary users to delete them. For example, we could tell Docker to run as an ordinary user instead of root. =='''Time to be someone else'''== Fortunately, docker run gives us a way to do this: the --user parameter. We're going to use it to specify the user ID (UID) and group ID (GID) that Docker should use. This works because Docker containers all share the same kernel, and therefore the same list of UIDs and GIDs, even if the associated usernames are not known to the containers (more on that later). To run our asset build, we could use a command something like this: docker container run --rm -it \ -v $(app):/app \ # Mount the source code --workdir /app \ # Set the working dir --user 1000:1000 \ # Run as the given user my-docker/my-build-environment:latest \ # Our build env image make assets # ... and the command! This will tell Docker to run its processes with user ID 1000 and group ID 1000. That will mean that any files created by that process also belong to the user with ID 1000. =='''But I just want to be me!'''== But what if we don’t know the current user’s ID? Is there some way to automatically discover that? There is: id is a program for finding out exactly this information. We can use it with the -u switch to get the UID, and the -g switch to get the GID. So instead of setting --user 1000:1000, we could use subshells to set --user $(id -u):$(id -g). That way, we can always use the current user's UID and GID. =='''docker-compose'''== We often like to run our tests and things using docker-compose, so that we can spin up any required services as needed - databases and so on. So wouldn't it be nice if we could do this with docker-compose as well? Unfortunately, we can’t use subshells in a compose file — it’s not a supported part of the format. Lucky for us, we can insert environment variables. So if we have a docker-compose.yml like this: # This is an abbreviated example docker-compose.yml version: '3.3' services: rspec: image: my-docker/my-build-environment:latest environment: - RAILS_ENV=test command: ["make", "assets"] # THIS BIT!!!1! user: ${CURRENT_UID} volumes: - .:/app We could use a little bash to set that variable and start docker-compose: CURRENT_UID=$(id -u):$(id -g) docker-compose up Et voila! Our Dockerised script will create files as if it were the host user! =='''Gotchas'''== '''Your user will be $HOME-less.''' What we’re actually doing here is asking our Docker container to do things using the ID of a user it knows nothing about, and that creates some complications. Namely, it means that the user is missing some of the things we’ve learned to simply expect users to have — things like a home directory. This can be troublesome, because it means that all the things that live in $HOME — temporary files, application settings, package caches — now have nowhere to live. The containerised process just has no way to know where to put them. This can impact us when we’re trying to do user-specific things. We found that it caused problems using gem install (though using Bundler is OK), or running code that relies on ENV['HOME']. So it may mean that you need to make some adjustments if you do either of those things. '''Your user will be nameless, too''' It also turns out that we can’t easily share usernames between a Docker host and its containers. That’s why we can’t just use docker run --user=$(whoami) — the container doesn't know about your username. It can only find out about your user by its UID. That means that when you run whoami inside your container, you'll get a result like I have no name!. That's entertaining, but if your code relies on knowing your username, you might get some confusing results. '''Wrapping Up''' We now have a way to use docker run and docker-compose to create files, without having to use sudo to clean them up! Happy building! 3ad10ac81eedae5e14d4f7b592bc53133527bdfe Overview of using Slurm 0 32 445 370 2023-11-29T17:39:48Z Weiler 3 wikitext text/x-wiki When using Slurm, you will need to log into the Slurm head node (currently phoenix.prism). Once you have ssh'd in there, you can execute slurm batch or interactive commands. You might also want to consult the [[Quick Reference Guide]]. == Submit a Slurm Batch Job == In order to submit a Slurm batch job list, you will need to create a directory that you will have read and write access to on all the nodes (which will often be a shared space). Let's say I have a batch named "experiment-1". I would create that directory in my groups area: % mkdir -p /private/groups/clusteradmin/weiler/slurm-jobs/experiment-1 % cd /private/groups/clusteradmin/weiler/slurm-jobs/experiment-1 Then you will need to create your job submission batch file. It will look something like this. My file is called 'slurm-test.sh': % vim slurm-test.sh Then populate the file as necessary: #!/bin/bash # Job name: #SBATCH --job-name=weiler_test # # Partition - This is the queue it goes in: #SBATCH --partition=short # # Where to send email (optional) #SBATCH --mail-user=weiler@ucsc.edu # # Number of nodes you need per job: #SBATCH --nodes=1 # # Memory needed for the jobs. Try very hard to make this accurate. DEFAULT = 4gb #SBATCH --mem=4gb # # Number of tasks (one for each CPU desired for use case) (example): #SBATCH --ntasks=1 # # Processors per task: # At least eight times the number of GPUs needed for nVidia RTX A5500 #SBATCH --cpus-per-task=1 # # Number of GPUs, this can be in the format of "--gres=gpu:[1-8]", or "--gres=gpu:A5500:[1-8]" with the type included (optional) #SBATCH --gres=gpu:1 # # Standard output and error log #SBATCH --output=serial_test_%j.log # # Wall clock limit in hrs:min:sec: #SBATCH --time=00:00:30 # ## Command(s) to run (example): pwd; hostname; date echo "Running test script on a single CPU core" sleep 5 echo "Test done!" date Keep the "SBATCH" lines commented, the scheduler will read them anyway. If you don't need a particular option, just don't include it in the file. To submit the batch job: % sbatch slurm-test.sh Submitted batch job 7 The job(s) will then be scheduled. You can see the state of the queue as such: % squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 7 batch weiler_t weiler R 0:07 1 phoenix-01 The job will output any STDOUT or STDERR in the directory you launched the job from. Other than that, it will do whatever the job does, even if there is no STDOUT. == Launching Several Jobs at Once == You can launch many jobs at once using the $SLURM_ARRAY_TASK_ID variable. Add something like the following to your batch submission file: #SBATCH --array=0-31 #SBATCH --output=array_job_%A_task_%a.out #SBATCH --error=array_job_%A_task_%a.err ## Command(s) to run: echo "I am task $SLURM_ARRAY_TASK_ID" == CGROUPS and Resource Management == Our installation of Slurm will utilize Linux CGROUPS, which puts a hard resource cap on jobs. If you define that your job will need 4GB of RAM, and it uses 5GB, it will fail with an OOM exception. Likewise with CPU or GPU resources; if your job ends up using more than you specify, it will fail. Or the "--time" batch file option, your job will fail if it takes longer than what you specify there. This is to keep the nodes from crashing from runaway jobs that use more resources than you think they will. So... TEST YOUR JOBS! Find our how much in resources a single job needs before you launch 100 of them. == TEST YOUR JOBS! == Let me say that one more time. Test your jobs before launching a bunch of them! If it fails, you don't want it to fail 100 or more times. You can also get a good idea of how much RAM and CPU it will need so you can better define your batch files. It's critical to get a good idea of how many resources each of your jobs will use and define your job file appropriately. 82f2a085eeb65518f90700cc564f1df13365cc6a Slurm Tips for vg 0 37 453 320 2024-01-05T15:11:59Z Anovak 4 wikitext text/x-wiki This page explains how to set up a development environment for [https://github.com/vgteam/vg vg] on the Phoenix cluster. ==Setting Up== 1. After connecting to the VPN, connect to the cluster head node: ssh phoenix.prism This node is relatively small, so you shouldn't run real work on it, but it is the place you need to be to submit Slurm jobs. 2. Make yourself a user directory under '''/private/groups''', which is where large data must be stored. For example, if you are in the Paten lab: mkdir /private/groups/patenlab/$USER 3. (Optional) Link it over to your home directory, so it is easy to use storage there to store your repos. The '''/private/groups''' storage may be faster than the home directory storage. mkdir -p /private/groups/patenlab/$USER/workspace ln -s /private/groups/patenlab/$USER/workspace ~/workspace 4. Make sure you have SSH keys created and add them to Github. cat ~/.ssh/id_ed25519.pub || (ssh-keygen -t dsa && cat ~/.ssh/id_ed25519.pub) # Paste into https://github.com/settings/ssh/new 5. Make a place to put your clone, and clone vg: mkdir -p ~/workspace cd ~/workspace git clone --recursive git@github.com:vgteam/vg.git cd vg 6. vg's dependencies should already be installed on the cluster nodes. If any of them seem to be missing, tell cluster-admin@soe.ucsc.edu to install them. 7. Build vg as a Slurm job. This will send the build out to the cluster as a 64-core, 80G memory job, and keep the output logs in your terminal. srun -c 64 --mem=80G --time=00:30:00 make -j64 This will leave your vg binary at '''~/workspace/vg/bin/vg'''. ==Misc Tips== * If you want an interactive session with appreciable resources, you can schedule one with '''srun'''. For example, to get 16 cores and 120G memory all for you, run: srun -c 16 --mem 120G --time=08:00:00 --partition=medium --pty bash -i * To send out a job without making a script file for it, use '''sbatch --wrap "your command here"'''. * You can use arguments from SBATCH lines on the command line! * You can use [https://github.com/CLIP-HPC/SlurmCommander#readme Slurm Commander] to watch the state of the cluster with the '''scom''' command. a093b972ba428e3dc9f1d57de47a300eb3866c0d Slurm Queues (Partitions) and Resource Management 0 48 454 446 2024-01-05T15:12:56Z Anovak 4 /* Partitions */ wikitext text/x-wiki == Partitions == Due to heterogeneous workloads and different batch requirements, we have implemented partitions in slurm, which are similar to queues. Each partition has different default and maximum walltime limits (aka "runtime" limits). You will need to select a partition to launch your jobs in based on what kind of jobs they are and how long they are expected to run. {| class="wikitable" |- style="font-weight:bold;" ! Partition Name ! Default Walltime Limit ! Maximum Walltime Limit ! style="border-color:inherit;" | Default Partition? ! Job Priority ! Maximum Nodes Utilized |- | short | 10 minutes | 1 hour | style="border-color:inherit;" | Yes | Normal | All |- | medium | 1 hour | 12 hours | style="border-color:inherit;" | No | Normal | 15 |- | long | 12 hours | 7 days | style="border-color:inherit;" | No | Normal | 10 |- | high_priority | 10 minutes | 7 days | style="border-color:inherit;" | No | High | All<br /> |- | gpu | 10 minutes | 7 days | No | Normal | 6 |} If you do not specify a partition to run your job in (with e.g. <code>--partition=medium</code>), it will automatically be assigned the "short" partition by default. If you do not specify a walltime value in your job submission script (with e.g. <code>--time=00:30:00</code>), it will inherit the "Default Walltime Limit" of the partition it is assigned. Therefore, it is a very good idea to specify which partition your job will go in, and you should also specify a walltime limit, otherwise your jobs will inherit the default walltime limit in the chart above. This all means that it is very important to '''TEST''' your jobs before running many of them! Submit one job and note how much resources it takes (RAM, CPU) and how long it takes to run. Then when you submit many of those jobs, you can correctly specify the number of CPU cores your job needs, how much RAM it needs (pad it by about 20% just in case), and how much time it needs to run (pad it by 40% to account for environmental variables like disk IO load and CPU context switching load). You can test your jobs by running one job via '''srun''' with fairly high CPU, RAM and walltime limits (just so it isn't killed due to default limits), then noting how much in resources it consumed while running (after it finishes). '''Example''' seff 769059 '''Output''' Job ID: 769059 Cluster: phoenix User/Group: <user-name>/<group-name> State: COMPLETED (exit code 0) Nodes: 1 Cores per node: 16 CPU Utilized: 00:00:01 CPU Efficiency: 0.11% of 00:15:28 core-walltime Job Wall-clock time: 00:00:58 Memory Utilized: 4.79 MB Memory Efficiency: 4.79% of 100.00 MB So if I needed to run like 1000 of these jobs, and they were all similar, I would select the "short" partition, 1 CPU core, maybe specify 8MB RAM, and maybe 90 seconds walltime limit. Note how I padded the RAM and walltime a bit to account for unexpected variable cluster conditions. == '''high_priority''' Partition Notes == The "high_priority" partition is special in that it will have the highest priority of all jobs on the cluster and will push all other jobs aside in an effort to finish jobs in that partition as fast as possible. This is only available for emergency or mission critical batches that need to be completed in an unexpectedly critically fast way. Access to this partition is only granted on a per request basis, and is temporary until your batch finishes. Email '''cluster-admin@soe.ucsc.edu''' if you need to access the high_priority queue and make your case why it is necessary. aac3685ea2a872d14aa402b9ea5e62983a31b570 456 454 2024-01-22T16:44:35Z Anovak 4 Document how the priority system works and how to make the scheduler account for its choices wikitext text/x-wiki == Partitions == Due to heterogeneous workloads and different batch requirements, we have implemented partitions in slurm, which are similar to queues. Each partition has different default and maximum walltime limits (aka "runtime" limits). You will need to select a partition to launch your jobs in based on what kind of jobs they are and how long they are expected to run. {| class="wikitable" |- style="font-weight:bold;" ! Partition Name ! Default Walltime Limit ! Maximum Walltime Limit ! style="border-color:inherit;" | Default Partition? ! Job Priority ! Maximum Nodes Utilized |- | short | 10 minutes | 1 hour | style="border-color:inherit;" | Yes | Normal | All |- | medium | 1 hour | 12 hours | style="border-color:inherit;" | No | Normal | 15 |- | long | 12 hours | 7 days | style="border-color:inherit;" | No | Normal | 10 |- | high_priority | 10 minutes | 7 days | style="border-color:inherit;" | No | High | All<br /> |- | gpu | 10 minutes | 7 days | No | Normal | 6 |} If you do not specify a partition to run your job in (with e.g. <code>--partition=medium</code>), it will automatically be assigned the "short" partition by default. If you do not specify a walltime value in your job submission script (with e.g. <code>--time=00:30:00</code>), it will inherit the "Default Walltime Limit" of the partition it is assigned. Therefore, it is a very good idea to specify which partition your job will go in, and you should also specify a walltime limit, otherwise your jobs will inherit the default walltime limit in the chart above. This all means that it is very important to '''TEST''' your jobs before running many of them! Submit one job and note how much resources it takes (RAM, CPU) and how long it takes to run. Then when you submit many of those jobs, you can correctly specify the number of CPU cores your job needs, how much RAM it needs (pad it by about 20% just in case), and how much time it needs to run (pad it by 40% to account for environmental variables like disk IO load and CPU context switching load). You can test your jobs by running one job via '''srun''' with fairly high CPU, RAM and walltime limits (just so it isn't killed due to default limits), then noting how much in resources it consumed while running (after it finishes). '''Example''' seff 769059 '''Output''' Job ID: 769059 Cluster: phoenix User/Group: <user-name>/<group-name> State: COMPLETED (exit code 0) Nodes: 1 Cores per node: 16 CPU Utilized: 00:00:01 CPU Efficiency: 0.11% of 00:15:28 core-walltime Job Wall-clock time: 00:00:58 Memory Utilized: 4.79 MB Memory Efficiency: 4.79% of 100.00 MB So if I needed to run like 1000 of these jobs, and they were all similar, I would select the "short" partition, 1 CPU core, maybe specify 8MB RAM, and maybe 90 seconds walltime limit. Note how I padded the RAM and walltime a bit to account for unexpected variable cluster conditions. == '''high_priority''' Partition Notes == The "high_priority" partition is special in that it will have the highest priority of all jobs on the cluster and will push all other jobs aside in an effort to finish jobs in that partition as fast as possible. This is only available for emergency or mission critical batches that need to be completed in an unexpectedly critically fast way. Access to this partition is only granted on a per request basis, and is temporary until your batch finishes. Email '''cluster-admin@soe.ucsc.edu''' if you need to access the high_priority queue and make your case why it is necessary. == My job is not running but I want it to be running! == Even if your job is in the high-priority partition, that doesn't mean that the cluster will drop everything and run it immediately. Because we don't have pre-emption set up, high priority jobs still have to wait for currently-running jobs to finish, as well as for other high-priority jobs. And since, as noted above, jobs can be allowed to run for up to 7 days each, it is physically possible for even the highest-priority job in the whole cluster to not start for a whole week. Here is a [https://docs-research-it.berkeley.edu/services/high-performance-computing/user-guide/running-your-jobs/why-job-not-run/ good resource from Berkeley] about understanding and debugging Slurm job scheduling. Basically, Slurm uses the wall-clock limits of running jobs, and of jobs in the queue, to make a plan to start each job on some node at some time in the future. If jobs finish early, other jobs can start sooner than scheduled, and if there is space around higher-priority jobs, lower-priority jobs can be filled in. If you want to know when Slurm plans to run your job, and why that is not right now, you can use the <code>--start</code> option for the <code>squeue</code> command: $ squeue -j 1719584 --start JOBID PARTITION NAME USER ST START_TIME NODES SCHEDNODES NODELIST(REASON) 1719584 short snakemak flastnam PD 2024-01-22T10:20:00 1 phoenix-00 (Priority) The <code>START_TIME</code> column is the time by which Slurm is sure it will be able to start your job if no higher-priority jobs come in first, and the <code>NODELIST(REASON)</code> column shows the nodes the job is running on, or the reason it is not running now, in parentheses. In this case, the job is not running because higher-priority jobs are in the way. 7595fd63c11808a8be0735792874c1c84832efd4 474 456 2024-03-25T20:29:53Z Anovak 4 /* My job is not running but I want it to be running! */ wikitext text/x-wiki == Partitions == Due to heterogeneous workloads and different batch requirements, we have implemented partitions in slurm, which are similar to queues. Each partition has different default and maximum walltime limits (aka "runtime" limits). You will need to select a partition to launch your jobs in based on what kind of jobs they are and how long they are expected to run. {| class="wikitable" |- style="font-weight:bold;" ! Partition Name ! Default Walltime Limit ! Maximum Walltime Limit ! style="border-color:inherit;" | Default Partition? ! Job Priority ! Maximum Nodes Utilized |- | short | 10 minutes | 1 hour | style="border-color:inherit;" | Yes | Normal | All |- | medium | 1 hour | 12 hours | style="border-color:inherit;" | No | Normal | 15 |- | long | 12 hours | 7 days | style="border-color:inherit;" | No | Normal | 10 |- | high_priority | 10 minutes | 7 days | style="border-color:inherit;" | No | High | All<br /> |- | gpu | 10 minutes | 7 days | No | Normal | 6 |} If you do not specify a partition to run your job in (with e.g. <code>--partition=medium</code>), it will automatically be assigned the "short" partition by default. If you do not specify a walltime value in your job submission script (with e.g. <code>--time=00:30:00</code>), it will inherit the "Default Walltime Limit" of the partition it is assigned. Therefore, it is a very good idea to specify which partition your job will go in, and you should also specify a walltime limit, otherwise your jobs will inherit the default walltime limit in the chart above. This all means that it is very important to '''TEST''' your jobs before running many of them! Submit one job and note how much resources it takes (RAM, CPU) and how long it takes to run. Then when you submit many of those jobs, you can correctly specify the number of CPU cores your job needs, how much RAM it needs (pad it by about 20% just in case), and how much time it needs to run (pad it by 40% to account for environmental variables like disk IO load and CPU context switching load). You can test your jobs by running one job via '''srun''' with fairly high CPU, RAM and walltime limits (just so it isn't killed due to default limits), then noting how much in resources it consumed while running (after it finishes). '''Example''' seff 769059 '''Output''' Job ID: 769059 Cluster: phoenix User/Group: <user-name>/<group-name> State: COMPLETED (exit code 0) Nodes: 1 Cores per node: 16 CPU Utilized: 00:00:01 CPU Efficiency: 0.11% of 00:15:28 core-walltime Job Wall-clock time: 00:00:58 Memory Utilized: 4.79 MB Memory Efficiency: 4.79% of 100.00 MB So if I needed to run like 1000 of these jobs, and they were all similar, I would select the "short" partition, 1 CPU core, maybe specify 8MB RAM, and maybe 90 seconds walltime limit. Note how I padded the RAM and walltime a bit to account for unexpected variable cluster conditions. == '''high_priority''' Partition Notes == The "high_priority" partition is special in that it will have the highest priority of all jobs on the cluster and will push all other jobs aside in an effort to finish jobs in that partition as fast as possible. This is only available for emergency or mission critical batches that need to be completed in an unexpectedly critically fast way. Access to this partition is only granted on a per request basis, and is temporary until your batch finishes. Email '''cluster-admin@soe.ucsc.edu''' if you need to access the high_priority queue and make your case why it is necessary. == My job is not running but I want it to be running == Even if your job is in the high-priority partition, that doesn't mean that the cluster will drop everything and run it immediately. Because we don't have pre-emption set up, high priority jobs still have to wait for currently-running jobs to finish, as well as for other high-priority jobs. And since, as noted above, jobs can be allowed to run for up to 7 days each, it is physically possible for even the highest-priority job in the whole cluster to not start for a whole week. Here is a [https://docs-research-it.berkeley.edu/services/high-performance-computing/user-guide/running-your-jobs/why-job-not-run/ good resource from Berkeley] about understanding and debugging Slurm job scheduling. Basically, Slurm uses the wall-clock limits of running jobs, and of jobs in the queue, to make a plan to start each job on some node at some time in the future. If jobs finish early, other jobs can start sooner than scheduled, and if there is space around higher-priority jobs, lower-priority jobs can be filled in. If you want to know when Slurm plans to run your job, and why that is not right now, you can use the <code>--start</code> option for the <code>squeue</code> command: $ squeue -j 1719584 --start JOBID PARTITION NAME USER ST START_TIME NODES SCHEDNODES NODELIST(REASON) 1719584 short snakemak flastnam PD 2024-01-22T10:20:00 1 phoenix-00 (Priority) The <code>START_TIME</code> column is the time by which Slurm is sure it will be able to start your job if no higher-priority jobs come in first, and the <code>NODELIST(REASON)</code> column shows the nodes the job is running on, or the reason it is not running now, in parentheses. In this case, the job is not running because higher-priority jobs are in the way. 2d70965d268255f270701b8eb46a2ad0d550d477 Slurm Tips for vg 0 37 455 453 2024-01-05T15:35:25Z Anovak 4 /* Setting Up */ wikitext text/x-wiki This page explains how to set up a development environment for [https://github.com/vgteam/vg vg] on the Phoenix cluster. ==Setting Up== 1. After connecting to the VPN, connect to the cluster head node: ssh phoenix.prism This node is relatively small, so you shouldn't run real work on it, but it is the place you need to be to submit Slurm jobs. 2. Make yourself a user directory under '''/private/groups''', which is where large data must be stored. For example, if you are in the Paten lab: mkdir /private/groups/patenlab/$USER 3. (Optional) Link it over to your home directory, so it is easy to use storage there to store your repos. The '''/private/groups''' storage may be faster than the home directory storage. mkdir -p /private/groups/patenlab/$USER/workspace ln -s /private/groups/patenlab/$USER/workspace ~/workspace 4. Make sure you have SSH keys created and add them to Github. cat ~/.ssh/id_ed25519.pub || (ssh-keygen -t ed25519 && cat ~/.ssh/id_ed25519.pub) # Paste into https://github.com/settings/ssh/new 5. Make a place to put your clone, and clone vg: mkdir -p ~/workspace cd ~/workspace git clone --recursive git@github.com:vgteam/vg.git cd vg 6. vg's dependencies should already be installed on the cluster nodes. If any of them seem to be missing, tell cluster-admin@soe.ucsc.edu to install them. 7. Build vg as a Slurm job. This will send the build out to the cluster as a 64-core, 80G memory job, and keep the output logs in your terminal. srun -c 64 --mem=80G --time=00:30:00 make -j64 This will leave your vg binary at '''~/workspace/vg/bin/vg'''. ==Misc Tips== * If you want an interactive session with appreciable resources, you can schedule one with '''srun'''. For example, to get 16 cores and 120G memory all for you, run: srun -c 16 --mem 120G --time=08:00:00 --partition=medium --pty bash -i * To send out a job without making a script file for it, use '''sbatch --wrap "your command here"'''. * You can use arguments from SBATCH lines on the command line! * You can use [https://github.com/CLIP-HPC/SlurmCommander#readme Slurm Commander] to watch the state of the cluster with the '''scom''' command. cd1f2443e3ecdcb2a91b221ed237507a23852045 Genomics Institute Computing Information 0 6 457 442 2024-01-25T02:57:01Z Weiler 3 /* GI Firewalled Computing Environment (PRISM) */ wikitext text/x-wiki Welcome to the Genomic Institute Computing Information Repository! Browse the below topics for help in the area you are curious about. == GI Public Computing Environment == *[[How to access the public servers]] == GI Firewalled Computing Environment (PRISM) == *[[Access to the Firewalled Compute Servers]] *[[Firewalled Computing Resources Overview]] *[[Firewalled Environment Storage Overview]] *[[Firewalled User Account and Storage Cost]] *[[Grafana Performance Metrics]] ==VPN Access== *[[Requirement for users to get GI VPN access]] == NIH dbGaP Access Requirements == *[[Requirements for dbGaP Access]] == giCloud Openstack == *[[Overview of giCloud in the Genomics Institute]] *[[Quick Start Instructions to Get Rolling with OpenStack]] == Amazon Web Services Information == *[[Overview of Getting and Using an AWS IAM Account]] *[[AWS Account List and Numbers]] *[[AWS Shared Bucket Usage Graphs]] *[[AWS Best Practices]] *[[AWS S3 Lifecycle Management]] == Slurm at the Genomics Institute == *[[Overview of using Slurm]] *[[Cluster Etiquette]] *[[Annotated Slurm Script]] *[[Job Arrays]] *[[GPU Resources]] *[[Quick Reference Guide]] *[[Slurm Queues (Partitions) and Resource Management]] *[[Slurm Tips for vg]] *[[Slurm Tips for Toil]] *[[Using Docker under Slurm]] *[[Phoenix WDL Tutorial]] ==General Docker Information== *[[Running a Container as a non-root User]] == Problems or technical support == If you have any problems with the GI computing environment, please send an email to '''cluster-admin@soe.ucsc.edu''' 4133e3ae4508aaffec5ad9057c9b7647a8083465 478 457 2024-04-26T18:04:47Z Weiler 3 /* GI Firewalled Computing Environment (PRISM) */ wikitext text/x-wiki Welcome to the Genomic Institute Computing Information Repository! Browse the below topics for help in the area you are curious about. == GI Public Computing Environment == *[[How to access the public servers]] == GI Firewalled Computing Environment (PRISM) == *[[Access to the Firewalled Compute Servers]] *[[Firewalled Computing Resources Overview]] *[[Firewalled Environment Storage Overview]] *[[Firewalled User Account and Storage Cost]] *[[Grafana Performance Metrics]] *[[Visual Studio Code (vscode) Configuration Tweaks]] ==VPN Access== *[[Requirement for users to get GI VPN access]] == NIH dbGaP Access Requirements == *[[Requirements for dbGaP Access]] == giCloud Openstack == *[[Overview of giCloud in the Genomics Institute]] *[[Quick Start Instructions to Get Rolling with OpenStack]] == Amazon Web Services Information == *[[Overview of Getting and Using an AWS IAM Account]] *[[AWS Account List and Numbers]] *[[AWS Shared Bucket Usage Graphs]] *[[AWS Best Practices]] *[[AWS S3 Lifecycle Management]] == Slurm at the Genomics Institute == *[[Overview of using Slurm]] *[[Cluster Etiquette]] *[[Annotated Slurm Script]] *[[Job Arrays]] *[[GPU Resources]] *[[Quick Reference Guide]] *[[Slurm Queues (Partitions) and Resource Management]] *[[Slurm Tips for vg]] *[[Slurm Tips for Toil]] *[[Using Docker under Slurm]] *[[Phoenix WDL Tutorial]] ==General Docker Information== *[[Running a Container as a non-root User]] == Problems or technical support == If you have any problems with the GI computing environment, please send an email to '''cluster-admin@soe.ucsc.edu''' dae2f38558ffc2605964b35a24843e2d6d376e19 Grafana Performance Metrics 0 50 458 2024-01-25T03:00:42Z Weiler 3 Created page with "We are tracking server and cluster node performance metrics over time via the '''Grafana''' software suite. If you want to see past and present performance metrics of a particular server or phoenix cluster node, make sure you are connected to the VPN, then navigate to this website: http://grafana.prism Then login using the following credentials: username: guest password: MoreStats4me" wikitext text/x-wiki We are tracking server and cluster node performance metrics over time via the '''Grafana''' software suite. If you want to see past and present performance metrics of a particular server or phoenix cluster node, make sure you are connected to the VPN, then navigate to this website: http://grafana.prism Then login using the following credentials: username: guest password: MoreStats4me 106bd17ef6bfd9978158b1a56c272dd4854350c6 460 458 2024-01-25T03:04:16Z Weiler 3 wikitext text/x-wiki We are tracking server and cluster node performance metrics over time via the '''Grafana''' software suite. If you want to see past and present performance metrics of a particular server or phoenix cluster node, make sure you are connected to the VPN, then navigate to this website: http://grafana.prism Then login using the following credentials: username: guest password: MoreStats4me Once logged in, click the small button on the top left of the window with the three small horizontal bars in it, and navigate to the "Dashboards" menu item. [[File:grafana_menu.png|900px]] Then, 37df8302871d1a190029c4b0293d31bbec2437ee 462 460 2024-01-25T03:07:24Z Weiler 3 wikitext text/x-wiki We are tracking server and cluster node performance metrics over time via the '''Grafana''' software suite. If you want to see past and present performance metrics of a particular server or phoenix cluster node, make sure you are connected to the VPN, then navigate to this website: http://grafana.prism Then login using the following credentials: username: guest password: MoreStats4me Once logged in, click the small button on the top left of the window with the three small horizontal bars in it, and navigate to the "Dashboards" menu item. [[File:grafana_menu.png|900px]] From there, you should be able to see the various sub-folders of the different dashboards for different classes of machines. [[File:grafana_dashboards.png|900px]] 94ea01f302cde36eec28a2520e078b135f0b581b 463 462 2024-01-25T03:07:49Z Weiler 3 wikitext text/x-wiki We are tracking server and cluster node performance metrics over time via the '''Grafana''' software suite. If you want to see past and present performance metrics of a particular server or phoenix cluster node, make sure you are connected to the VPN, then navigate to this website: http://grafana.prism Then login using the following credentials: username: guest password: MoreStats4me Once logged in, click the small button on the top left of the window with the three small horizontal bars in it, and navigate to the "Dashboards" menu item. [[File:grafana_menu.png|900px]] From there, you should be able to see the various sub-folders of the different dashboards for different classes of machines. [[File:grafana_dashboards.png|1200px]] ec75acd0f9eac601aa52e889b2ad47ad535eb163 464 463 2024-01-25T03:35:38Z Weiler 3 wikitext text/x-wiki We are tracking server and cluster node performance metrics over time via the '''Grafana''' software suite. This is only available in the firewalled/PRISM area. If you want to see past and present performance metrics of a particular server or phoenix cluster node, make sure you are connected to the VPN, then navigate to this website: http://grafana.prism Then login using the following credentials: username: guest password: MoreStats4me Once logged in, click the small button on the top left of the window with the three small horizontal bars in it, and navigate to the "Dashboards" menu item. [[File:grafana_menu.png|900px]] From there, you should be able to see the various sub-folders of the different dashboards for different classes of machines. [[File:grafana_dashboards.png|1200px]] 0bd5db29b29fe49974b4d334b4798d82857f5a9c 465 464 2024-01-31T02:39:27Z Weiler 3 wikitext text/x-wiki We are tracking server and cluster node performance metrics over time via the '''Grafana''' software suite. This is only available in the firewalled/PRISM area. If you want to see past and present performance metrics of a particular server or phoenix cluster node, make sure you are connected to the VPN, then navigate to this website: http://grafana.prism/dashboards Then login using the following credentials: username: guest password: MoreStats4me Once logged in, click the small button on the top left of the window with the three small horizontal bars in it, and navigate to the "Dashboards" menu item. [[File:grafana_menu.png|900px]] From there, you should be able to see the various sub-folders of the different dashboards for different classes of machines. [[File:grafana_dashboards.png|1200px]] 2af934a5dfc2d0596e6423616bf12075d09d5cbd File:Grafana menu.png 6 51 459 2024-01-25T03:02:41Z Weiler 3 wikitext text/x-wiki da39a3ee5e6b4b0d3255bfef95601890afd80709 File:Grafana dashboards.png 6 52 461 2024-01-25T03:06:02Z Weiler 3 wikitext text/x-wiki da39a3ee5e6b4b0d3255bfef95601890afd80709 Phoenix WDL Tutorial 0 45 466 452 2024-01-31T15:24:57Z Anovak 4 Fix groups directory path wikitext text/x-wiki '''Tutorial: Getting Started with WDL Workflows on Phoenix''' Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you. Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order. This tutorial will help you get started writing and running workflows. The '''Phoenix Cluster Setup''' section is specifically for the UC Santa Cruz Genomics Institute's Phoenix Slurm cluster. The other sections are broadly applicable to other environments. By the end, you will be able to run workflows on Slurm with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong. =Phoenix Cluster Setup= Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH. ==Getting VPN access== We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall. To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster. ==Connecting to Phoenix== Once you have VPN access, you can connect to the "head node" of the Phoenix cluster. This node is where everyone logs in, but you should *not* run actual work on this node; it exists only to give you access to the files on the cluster and to the commands to control cluster jobs. To connect to the head node: 1. Connect to the VPN. 2. SSH to <code>phoenix.prism</code>. At the command line, run: ssh phoenix.prism If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run: ssh flastname@phoenix.prism The first time you connect, you will see a message like: The authenticity of host 'phoenix.prism (10.50.1.66)' can't be established. ED25519 key fingerprint is SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI. This key is not known by any other names Are you sure you want to continue connecting (yes/no/[fingerprint])? This is your computer asking you to help it decide if it is talking to the genuine <code>phoenix.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is. ==Installing Toil with WDL support== Once you are on the head node, you can install Toil, a program for running workflows. When installing, you need to specify that you want WDL support. To do this, you can run: pip install --upgrade --user 'toil[wdl]' If you also want to use AWS S3 <code>s3://</code> and/or Google <code>gs://</code> URLs for data, you will need to also install Toil with the <code>aws</code> and <code>google</code> extras, respectively: pip install --upgrade --user 'toil[wdl,aws,google]' This will install Toil in the <code>.local</code> directory inside your home directory, which we write as <code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>. By default, the command interpreter *will not* look there, so if you type <code>toil-wdl-runner</code>, it will complain that the command is not found. To fix this, you need to configure the command interpreter (bash) to look where Toil is installed. To do this, run: echo 'export PATH="${HOME}/.local/bin:${PATH}"' >>~/.bashrc After that, **log out and log back in**, to restart bash and pick up the change. To make sure it worked, you can run: toil-wdl-runner --help If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports. If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above. ==Configuring your Phoenix Environment== '''Do not try and store data in your home directory on Phoenix!''' The home directories are meant for code and programs. Any data worth running a workflow on should be in a directory under <code>/private/groups</code>. You will probably need to email the admins to get added to a group so you can create a directory to work in somewhere under <code>/private/groups</code>. Usually you would end up with <code>/private/groups/YOURGROUPNAME/YOURUSERNAME</code>. Remember this path; we will need it later. ==Configuring Toil for Phoenix== Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. However, since these files can be large, and the home directory quota is only 30 GB, we can't keep these in your home directory. We will need to use the <code>/private/groups/YOURGROUPNAME/YOURUSERNAME</code> directory you created earlier. Make sure that that directory is available in your <code>~/.bashrc</code> file by editing and running this command: echo 'BIG_DATA_DIR=/private/groups/YOURGROUPNAME/YOURUSERNAME' >>~/.bashrc Then use these commands to make sure that Toil knows where it ought to put its caches: echo 'export SINGULARITY_CACHEDIR="${BIG_DATA_DIR}/.singularity/cache"' >>~/.bashrc echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="${BIG_DATA_DIR}/.cache/miniwdl"' >>~/.bashrc After that, '''log out and log back in again''', to apply the changes. If you don't do this, Toil will re-download each container image, on each node, for each run of each workflow. That wastes a lot of time, and can exhaust the [https://docs.docker.com/docker-hub/download-rate-limit/#whats-the-download-rate-limit-on-docker-hub limits on how many containers you are allowed to download each day]. '''If you get errors about mutexes, lock files, or other weird problems with Singularity''', try moving these directories to inside <code>/data/tmp</code> on the individual nodes, or unsetting them and letting Toil use its defaults (and exhaust our Docker pull limits). [https://github.com/DataBiosphere/toil/issues/4654 It is not clear that <code>/private/groups</code> actually implements the necessary file locking correctly.] =Running an existing workflow= First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project]. First, go to your user directory under <code>/private/groups</code>, and make a directory to work in. cd /private/groups/YOURGROUPNAME/YOURUSERNAME mkdir workflow-test cd workflow-test Next, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally. wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl ==Preparing an input file== Near the top of the WDL file, there's a section like this: workflow hello_caller { input { File who } This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one. So first, we have to make that list of names. Let's make it in <code>names.txt</code> echo "Mridula Resurrección" >names.txt echo "Gershom Šarlota" >>names.txt echo "Ritchie Ravi" >>names.txt Then, we need to create an *inputs file*, which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this: echo '{"hello_caller.who": "./names.txt"}' >inputs.json Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification]. ==Testing at small scale single-machine== We are now ready to run the workflow! You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running: srun -c 2 --mem 8G --time=02:00:00 --partition=medium --pty bash -i This will start a new shell that can run for 2 hours; to leave it and go back to the head node you can use <code>exit</code>. In your new shell, run this Toil command: toil-wdl-runner self_test.wdl inputs.json -o local_run This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>. This will print a lot of logging to standard error, and to standard output it will print: {"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]} The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person. To leave your interactive Slurm session and return to the head node, use <code>exit</code>. ==Running at larger scale== Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people! Go get this handy list of people and cut it to length: wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt head -n100 1000_names.txt >100_names.txt And make a new inputs file: echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster. To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it. Additionally, since [https://github.com/DataBiosphere/toil/issues/4686 Toil can't manage Slurm partitions itself], we will use the <code>TOIL_SLURM_ARGS</code> environment variable to tell Toil how long jobs should be allowed to run for (2 hours) and what [[Slurm Queues (Partitions) and Resource Management | partition]] they should go in. mkdir -p logs export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium" toil-wdl-runner --jobStore ./big_store --batchSystem slurm --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory. =Writing your own workflow= In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz]. ==Writing the file== ===Version=== All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files). Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0. So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement: version 1.0 ===Workflow Block=== Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>. version 1.0 workflow FizzBuzz { } ===Input Block=== Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } } Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is ''not'' optional, and there is no default value, then the user's inputs file ''must'' specify a value for it in order for the workflow to run. ===Body=== Now we'll start on the body of the workflow, to be inserted just after the inputs section. The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable. Array[Int] numbers = range(item_count) WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more. ===Scattering=== Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 } ===Conditionals=== Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array. Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only ''once'' in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>. Note that WDL supports conditional ''expressions'' with a <code>then</code> and an <code>else</code>, but conditional ''statements'' only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition. So first, let's handle the special cases. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. } } ===Calling Tasks=== Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later. To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]) } We can put the code into the workflow now, and set about writing the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } ===Writing Tasks=== Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>. task stringify_number { } We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections. task stringify_number { input { Int the_number } # ??? output { String the_string # = ??? } } Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string # = ??? } } Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]). task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } } We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage. Then we can put our task into our WDL file: version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } ===Output Block=== Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren't supposed to need this, but you do need it in 1.1 and Toil doesn't actually send your outputs anywhere yet if you don't have one, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } output { Array[String] fizzbuzz_results = result } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array. ==Running the Workflow== Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs: echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json Then run it on the cluster with Toil: mkdir -p logs export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium" toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json Or locally: toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json =Debugging Workflows= Sometimes, your workflow won't work. Try these ideas for figuring out what is going wrong. ==Debugging Options== When debugging a workflow, make sure to run the workflow with <code>--logDebug</code>, to set the log level to <code>DEBUG</code>, and with <code>--jobStore /some/path/to/a/shared/directory/it/can/create</code> so that the stored files shipped between jobs are in a place you can access them. When debug logging is on, the log from every Toil job is inserted in the main Toil log between these markers: =========> Toil job log is here <========= Normally, only the logs of failing jobs are reproduced like this. ==Reading the Log== When a WDL workflow fails, you are likely to see a message like this: WDL.runtime.error.CommandFailed: task command failed with exit status 1 [2023-07-16T16:23:54-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host phoenix-15.prism This means that the command line command specified by one of your WDL tasks exited with a failing (i.e. nonzero) exit code, which will happen when either the command line command is written wrong, or when the error detection code in the tool you are trying to run detects and reports an error. Go up higher in the log until you find lines that look like: [2023-07-16T16:23:54-0700] [MainThread] [I] [toil.wdl.wdltoil] Standard error at /data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/stderr.txt: And [2023-07-16T16:23:54-0700] [MainThread] [I] [toil.wdl.wdltoil] Standard output at /data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/stdout.txt: These will be followed by the standard error and standard output log data from the task's command. There may be useful information (such as an error message from the underlying tool) in there. ==Reproducing Problems== When trying to fix a failing step, it is useful to be able to run a command outside of Toil or WDL that might reproduce the problem. In addition to getting the standard output and standard error logs as described above, you may also need input files for your tool in order to do this. In the log of your failing Toil taks, look for likes like this: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-4f886176ab8344baaf17dc72fc445445/toplog.sh' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmprwhi6h3q/toplog.sh' [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmpjyksfoko/Sample.bam' ... The <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam</code> part is a Toil file ID, and it is a relative path from your <code>--jobStore</code> value to where the file is stored on disk. So if you ran the workflow with <code>--jobStore /private/groups/patenlab/anovak/jobstore</code>, you would look for this file at: /private/groups/patenlab/anovak/jobstore/files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam ==More Ways of Finding Files== Sometimes, a step might not fail, but you still might want to see the files it is using as input. If you have the job store path, you can use the <code>find</code> command to try and find the files by name. For example, if you want to look at <code>Sample.bam</code>, you can look for it like this: find /path/to/the/jobstore -name "Sample.bam" If you want to find files that were ''uploaded'' from a job, look for lines like this in the job's log: [2023-07-16T15:58:39-0700] [MainThread] [D] [toil.wdl.wdltoil] Virtualized /data/tmp/2846b6012e3e5535add03b363950dd78/cb23/197c/work/bamPerChrs/Sample.chr14.bam as WDL file toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam You can take the <code>toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam</code> URI and URL-decode it with, for example, [https://www.urldecoder.io/], getting this: toilfile:2703483274:0:files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam Then you can take the part after the last colon, <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam</code>, and that is the path relative to the job store where this file can be found. ==Using Development Versions of Toil== Sometimes, bugs will be fixed in the development version of Toil, but not released yet. To try the current development version of Toil, you can install it like this: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl,aws,google]' If you want to use a particular branch or commit, like <code>aaa451b320fc115b3563ced25cb501301cf86f90</code>, you can do: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git@aaa451b320fc115b3563ced25cb501301cf86f90#egg=toil[wdl,aws,google]' ==Frequently Asked Questions== ===I am getting warnings about <code>XDG_RUNTIME_DIR</code>=== You may be seeing warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code> This happens because Slurm is providing Toil with an <code>XDG_RUNTIME_DIR</code> environment variable that points to a directory that doesn't exist, which the XDG spec says it shouldn't be doing. This is a known bug in the GI Slurm configuration, and Toil is letting you know that it is working around it. ===Toil said it was <code>Redirecting logging</code> somewhere, but I can't find that file!=== The Toil worker process for each job will say that it is <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code>, and when running in single machine mode these messages go to the main Toil log. The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil. If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node. =Additional WDL resources= For more information on writing and running WDL workflows, see: * [https://docs.openwdl.org/en/stable/ The WDL dcoumentation] * [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube] * [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification] 1f296172459b3db121231a9b68912e80d6675add 467 466 2024-01-31T15:28:10Z Anovak 4 Turn of caching as is demanded wikitext text/x-wiki '''Tutorial: Getting Started with WDL Workflows on Phoenix''' Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you. Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order. This tutorial will help you get started writing and running workflows. The '''Phoenix Cluster Setup''' section is specifically for the UC Santa Cruz Genomics Institute's Phoenix Slurm cluster. The other sections are broadly applicable to other environments. By the end, you will be able to run workflows on Slurm with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong. =Phoenix Cluster Setup= Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH. ==Getting VPN access== We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall. To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster. ==Connecting to Phoenix== Once you have VPN access, you can connect to the "head node" of the Phoenix cluster. This node is where everyone logs in, but you should *not* run actual work on this node; it exists only to give you access to the files on the cluster and to the commands to control cluster jobs. To connect to the head node: 1. Connect to the VPN. 2. SSH to <code>phoenix.prism</code>. At the command line, run: ssh phoenix.prism If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run: ssh flastname@phoenix.prism The first time you connect, you will see a message like: The authenticity of host 'phoenix.prism (10.50.1.66)' can't be established. ED25519 key fingerprint is SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI. This key is not known by any other names Are you sure you want to continue connecting (yes/no/[fingerprint])? This is your computer asking you to help it decide if it is talking to the genuine <code>phoenix.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is. ==Installing Toil with WDL support== Once you are on the head node, you can install Toil, a program for running workflows. When installing, you need to specify that you want WDL support. To do this, you can run: pip install --upgrade --user 'toil[wdl]' If you also want to use AWS S3 <code>s3://</code> and/or Google <code>gs://</code> URLs for data, you will need to also install Toil with the <code>aws</code> and <code>google</code> extras, respectively: pip install --upgrade --user 'toil[wdl,aws,google]' This will install Toil in the <code>.local</code> directory inside your home directory, which we write as <code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>. By default, the command interpreter *will not* look there, so if you type <code>toil-wdl-runner</code>, it will complain that the command is not found. To fix this, you need to configure the command interpreter (bash) to look where Toil is installed. To do this, run: echo 'export PATH="${HOME}/.local/bin:${PATH}"' >>~/.bashrc After that, **log out and log back in**, to restart bash and pick up the change. To make sure it worked, you can run: toil-wdl-runner --help If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports. If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above. ==Configuring your Phoenix Environment== '''Do not try and store data in your home directory on Phoenix!''' The home directories are meant for code and programs. Any data worth running a workflow on should be in a directory under <code>/private/groups</code>. You will probably need to email the admins to get added to a group so you can create a directory to work in somewhere under <code>/private/groups</code>. Usually you would end up with <code>/private/groups/YOURGROUPNAME/YOURUSERNAME</code>. Remember this path; we will need it later. ==Configuring Toil for Phoenix== Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. However, since these files can be large, and the home directory quota is only 30 GB, we can't keep these in your home directory. We will need to use the <code>/private/groups/YOURGROUPNAME/YOURUSERNAME</code> directory you created earlier. Make sure that that directory is available in your <code>~/.bashrc</code> file by editing and running this command: echo 'BIG_DATA_DIR=/private/groups/YOURGROUPNAME/YOURUSERNAME' >>~/.bashrc Then use these commands to make sure that Toil knows where it ought to put its caches: echo 'export SINGULARITY_CACHEDIR="${BIG_DATA_DIR}/.singularity/cache"' >>~/.bashrc echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="${BIG_DATA_DIR}/.cache/miniwdl"' >>~/.bashrc After that, '''log out and log back in again''', to apply the changes. If you don't do this, Toil will re-download each container image, on each node, for each run of each workflow. That wastes a lot of time, and can exhaust the [https://docs.docker.com/docker-hub/download-rate-limit/#whats-the-download-rate-limit-on-docker-hub limits on how many containers you are allowed to download each day]. '''If you get errors about mutexes, lock files, or other weird problems with Singularity''', try moving these directories to inside <code>/data/tmp</code> on the individual nodes, or unsetting them and letting Toil use its defaults (and exhaust our Docker pull limits). [https://github.com/DataBiosphere/toil/issues/4654 It is not clear that <code>/private/groups</code> actually implements the necessary file locking correctly.] =Running an existing workflow= First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project]. First, go to your user directory under <code>/private/groups</code>, and make a directory to work in. cd /private/groups/YOURGROUPNAME/YOURUSERNAME mkdir workflow-test cd workflow-test Next, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally. wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl ==Preparing an input file== Near the top of the WDL file, there's a section like this: workflow hello_caller { input { File who } This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one. So first, we have to make that list of names. Let's make it in <code>names.txt</code> echo "Mridula Resurrección" >names.txt echo "Gershom Šarlota" >>names.txt echo "Ritchie Ravi" >>names.txt Then, we need to create an *inputs file*, which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this: echo '{"hello_caller.who": "./names.txt"}' >inputs.json Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification]. ==Testing at small scale single-machine== We are now ready to run the workflow! You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running: srun -c 2 --mem 8G --time=02:00:00 --partition=medium --pty bash -i This will start a new shell that can run for 2 hours; to leave it and go back to the head node you can use <code>exit</code>. In your new shell, run this Toil command: toil-wdl-runner self_test.wdl inputs.json -o local_run This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>. This will print a lot of logging to standard error, and to standard output it will print: {"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]} The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person. To leave your interactive Slurm session and return to the head node, use <code>exit</code>. ==Running at larger scale== Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people! Go get this handy list of people and cut it to length: wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt head -n100 1000_names.txt >100_names.txt And make a new inputs file: echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster. To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. Until Toil [https://github.com/DataBiosphere/toil/issues/4775 gets support for data file caching on Slurm], we will also need the <code>--caching false</code> option. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it. Additionally, since [https://github.com/DataBiosphere/toil/issues/4686 Toil can't manage Slurm partitions itself], we will use the <code>TOIL_SLURM_ARGS</code> environment variable to tell Toil how long jobs should be allowed to run for (2 hours) and what [[Slurm Queues (Partitions) and Resource Management | partition]] they should go in. mkdir -p logs export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium" toil-wdl-runner --jobStore ./big_store --batchSystem slurm --caching false --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory. =Writing your own workflow= In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz]. ==Writing the file== ===Version=== All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files). Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0. So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement: version 1.0 ===Workflow Block=== Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>. version 1.0 workflow FizzBuzz { } ===Input Block=== Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } } Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is ''not'' optional, and there is no default value, then the user's inputs file ''must'' specify a value for it in order for the workflow to run. ===Body=== Now we'll start on the body of the workflow, to be inserted just after the inputs section. The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable. Array[Int] numbers = range(item_count) WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more. ===Scattering=== Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 } ===Conditionals=== Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array. Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only ''once'' in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>. Note that WDL supports conditional ''expressions'' with a <code>then</code> and an <code>else</code>, but conditional ''statements'' only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition. So first, let's handle the special cases. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. } } ===Calling Tasks=== Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later. To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]) } We can put the code into the workflow now, and set about writing the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } ===Writing Tasks=== Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>. task stringify_number { } We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections. task stringify_number { input { Int the_number } # ??? output { String the_string # = ??? } } Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string # = ??? } } Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]). task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } } We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage. Then we can put our task into our WDL file: version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } ===Output Block=== Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren't supposed to need this, but you do need it in 1.1 and Toil doesn't actually send your outputs anywhere yet if you don't have one, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } output { Array[String] fizzbuzz_results = result } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array. ==Running the Workflow== Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs: echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json Then run it on the cluster with Toil: mkdir -p logs export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium" toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --caching false --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json Or locally: toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json =Debugging Workflows= Sometimes, your workflow won't work. Try these ideas for figuring out what is going wrong. ==Debugging Options== When debugging a workflow, make sure to run the workflow with <code>--logDebug</code>, to set the log level to <code>DEBUG</code>, and with <code>--jobStore /some/path/to/a/shared/directory/it/can/create</code> so that the stored files shipped between jobs are in a place you can access them. When debug logging is on, the log from every Toil job is inserted in the main Toil log between these markers: =========> Toil job log is here <========= Normally, only the logs of failing jobs and the output of commands run from WDL are reproduced like this. ==Reading the Log== When a WDL workflow fails, you are likely to see a message like this: WDL.runtime.error.CommandFailed: task command failed with exit status 1 [2023-07-16T16:23:54-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host phoenix-15.prism This means that the command line command specified by one of your WDL tasks exited with a failing (i.e. nonzero) exit code, which will happen when either the command line command is written wrong, or when the error detection code in the tool you are trying to run detects and reports an error. Go up higher in the log until you find lines that look like: [2023-07-16T16:23:54-0700] [MainThread] [I] [toil.wdl.wdltoil] Standard error at /data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/stderr.txt: And [2023-07-16T16:23:54-0700] [MainThread] [I] [toil.wdl.wdltoil] Standard output at /data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/stdout.txt: These will be followed by the standard error and standard output log data from the task's command. There may be useful information (such as an error message from the underlying tool) in there. ==Reproducing Problems== When trying to fix a failing step, it is useful to be able to run a command outside of Toil or WDL that might reproduce the problem. In addition to getting the standard output and standard error logs as described above, you may also need input files for your tool in order to do this. In the log of your failing Toil taks, look for likes like this: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-4f886176ab8344baaf17dc72fc445445/toplog.sh' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmprwhi6h3q/toplog.sh' [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmpjyksfoko/Sample.bam' ... The <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam</code> part is a Toil file ID, and it is a relative path from your <code>--jobStore</code> value to where the file is stored on disk. So if you ran the workflow with <code>--jobStore /private/groups/patenlab/anovak/jobstore</code>, you would look for this file at: /private/groups/patenlab/anovak/jobstore/files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam ==More Ways of Finding Files== Sometimes, a step might not fail, but you still might want to see the files it is using as input. If you have the job store path, you can use the <code>find</code> command to try and find the files by name. For example, if you want to look at <code>Sample.bam</code>, you can look for it like this: find /path/to/the/jobstore -name "Sample.bam" If you want to find files that were ''uploaded'' from a job, look for lines like this in the job's log: [2023-07-16T15:58:39-0700] [MainThread] [D] [toil.wdl.wdltoil] Virtualized /data/tmp/2846b6012e3e5535add03b363950dd78/cb23/197c/work/bamPerChrs/Sample.chr14.bam as WDL file toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam You can take the <code>toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam</code> URI and URL-decode it with, for example, [https://www.urldecoder.io/], getting this: toilfile:2703483274:0:files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam Then you can take the part after the last colon, <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam</code>, and that is the path relative to the job store where this file can be found. ==Using Development Versions of Toil== Sometimes, bugs will be fixed in the development version of Toil, but not released yet. To try the current development version of Toil, you can install it like this: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl,aws,google]' If you want to use a particular branch or commit, like <code>aaa451b320fc115b3563ced25cb501301cf86f90</code>, you can do: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git@aaa451b320fc115b3563ced25cb501301cf86f90#egg=toil[wdl,aws,google]' ==Frequently Asked Questions== ===I am getting warnings about <code>XDG_RUNTIME_DIR</code>=== You may be seeing warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code> This happens because Slurm is providing Toil with an <code>XDG_RUNTIME_DIR</code> environment variable that points to a directory that doesn't exist, which the XDG spec says it shouldn't be doing. This is a known bug in the GI Slurm configuration, and Toil is letting you know that it is working around it. ===Toil said it was <code>Redirecting logging</code> somewhere, but I can't find that file!=== The Toil worker process for each job will say that it is <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code>, and when running in single machine mode these messages go to the main Toil log. The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil. If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node. =Additional WDL resources= For more information on writing and running WDL workflows, see: * [https://docs.openwdl.org/en/stable/ The WDL dcoumentation] * [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube] * [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification] 8241b6981f7e736293becfb599803bd7ca13be02 468 467 2024-02-06T15:09:09Z Anovak 4 /* Reading the Log */ Update with new examples of the new log logging format. wikitext text/x-wiki '''Tutorial: Getting Started with WDL Workflows on Phoenix''' Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you. Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order. This tutorial will help you get started writing and running workflows. The '''Phoenix Cluster Setup''' section is specifically for the UC Santa Cruz Genomics Institute's Phoenix Slurm cluster. The other sections are broadly applicable to other environments. By the end, you will be able to run workflows on Slurm with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong. =Phoenix Cluster Setup= Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH. ==Getting VPN access== We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall. To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster. ==Connecting to Phoenix== Once you have VPN access, you can connect to the "head node" of the Phoenix cluster. This node is where everyone logs in, but you should *not* run actual work on this node; it exists only to give you access to the files on the cluster and to the commands to control cluster jobs. To connect to the head node: 1. Connect to the VPN. 2. SSH to <code>phoenix.prism</code>. At the command line, run: ssh phoenix.prism If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run: ssh flastname@phoenix.prism The first time you connect, you will see a message like: The authenticity of host 'phoenix.prism (10.50.1.66)' can't be established. ED25519 key fingerprint is SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI. This key is not known by any other names Are you sure you want to continue connecting (yes/no/[fingerprint])? This is your computer asking you to help it decide if it is talking to the genuine <code>phoenix.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is. ==Installing Toil with WDL support== Once you are on the head node, you can install Toil, a program for running workflows. When installing, you need to specify that you want WDL support. To do this, you can run: pip install --upgrade --user 'toil[wdl]' If you also want to use AWS S3 <code>s3://</code> and/or Google <code>gs://</code> URLs for data, you will need to also install Toil with the <code>aws</code> and <code>google</code> extras, respectively: pip install --upgrade --user 'toil[wdl,aws,google]' This will install Toil in the <code>.local</code> directory inside your home directory, which we write as <code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>. By default, the command interpreter *will not* look there, so if you type <code>toil-wdl-runner</code>, it will complain that the command is not found. To fix this, you need to configure the command interpreter (bash) to look where Toil is installed. To do this, run: echo 'export PATH="${HOME}/.local/bin:${PATH}"' >>~/.bashrc After that, **log out and log back in**, to restart bash and pick up the change. To make sure it worked, you can run: toil-wdl-runner --help If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports. If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above. ==Configuring your Phoenix Environment== '''Do not try and store data in your home directory on Phoenix!''' The home directories are meant for code and programs. Any data worth running a workflow on should be in a directory under <code>/private/groups</code>. You will probably need to email the admins to get added to a group so you can create a directory to work in somewhere under <code>/private/groups</code>. Usually you would end up with <code>/private/groups/YOURGROUPNAME/YOURUSERNAME</code>. Remember this path; we will need it later. ==Configuring Toil for Phoenix== Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. However, since these files can be large, and the home directory quota is only 30 GB, we can't keep these in your home directory. We will need to use the <code>/private/groups/YOURGROUPNAME/YOURUSERNAME</code> directory you created earlier. Make sure that that directory is available in your <code>~/.bashrc</code> file by editing and running this command: echo 'BIG_DATA_DIR=/private/groups/YOURGROUPNAME/YOURUSERNAME' >>~/.bashrc Then use these commands to make sure that Toil knows where it ought to put its caches: echo 'export SINGULARITY_CACHEDIR="${BIG_DATA_DIR}/.singularity/cache"' >>~/.bashrc echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="${BIG_DATA_DIR}/.cache/miniwdl"' >>~/.bashrc After that, '''log out and log back in again''', to apply the changes. If you don't do this, Toil will re-download each container image, on each node, for each run of each workflow. That wastes a lot of time, and can exhaust the [https://docs.docker.com/docker-hub/download-rate-limit/#whats-the-download-rate-limit-on-docker-hub limits on how many containers you are allowed to download each day]. '''If you get errors about mutexes, lock files, or other weird problems with Singularity''', try moving these directories to inside <code>/data/tmp</code> on the individual nodes, or unsetting them and letting Toil use its defaults (and exhaust our Docker pull limits). [https://github.com/DataBiosphere/toil/issues/4654 It is not clear that <code>/private/groups</code> actually implements the necessary file locking correctly.] =Running an existing workflow= First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project]. First, go to your user directory under <code>/private/groups</code>, and make a directory to work in. cd /private/groups/YOURGROUPNAME/YOURUSERNAME mkdir workflow-test cd workflow-test Next, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally. wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl ==Preparing an input file== Near the top of the WDL file, there's a section like this: workflow hello_caller { input { File who } This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one. So first, we have to make that list of names. Let's make it in <code>names.txt</code> echo "Mridula Resurrección" >names.txt echo "Gershom Šarlota" >>names.txt echo "Ritchie Ravi" >>names.txt Then, we need to create an *inputs file*, which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this: echo '{"hello_caller.who": "./names.txt"}' >inputs.json Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification]. ==Testing at small scale single-machine== We are now ready to run the workflow! You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running: srun -c 2 --mem 8G --time=02:00:00 --partition=medium --pty bash -i This will start a new shell that can run for 2 hours; to leave it and go back to the head node you can use <code>exit</code>. In your new shell, run this Toil command: toil-wdl-runner self_test.wdl inputs.json -o local_run This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>. This will print a lot of logging to standard error, and to standard output it will print: {"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]} The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person. To leave your interactive Slurm session and return to the head node, use <code>exit</code>. ==Running at larger scale== Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people! Go get this handy list of people and cut it to length: wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt head -n100 1000_names.txt >100_names.txt And make a new inputs file: echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster. To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. Until Toil [https://github.com/DataBiosphere/toil/issues/4775 gets support for data file caching on Slurm], we will also need the <code>--caching false</code> option. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it. Additionally, since [https://github.com/DataBiosphere/toil/issues/4686 Toil can't manage Slurm partitions itself], we will use the <code>TOIL_SLURM_ARGS</code> environment variable to tell Toil how long jobs should be allowed to run for (2 hours) and what [[Slurm Queues (Partitions) and Resource Management | partition]] they should go in. mkdir -p logs export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium" toil-wdl-runner --jobStore ./big_store --batchSystem slurm --caching false --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory. =Writing your own workflow= In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz]. ==Writing the file== ===Version=== All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files). Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0. So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement: version 1.0 ===Workflow Block=== Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>. version 1.0 workflow FizzBuzz { } ===Input Block=== Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } } Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is ''not'' optional, and there is no default value, then the user's inputs file ''must'' specify a value for it in order for the workflow to run. ===Body=== Now we'll start on the body of the workflow, to be inserted just after the inputs section. The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable. Array[Int] numbers = range(item_count) WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more. ===Scattering=== Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 } ===Conditionals=== Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array. Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only ''once'' in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>. Note that WDL supports conditional ''expressions'' with a <code>then</code> and an <code>else</code>, but conditional ''statements'' only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition. So first, let's handle the special cases. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. } } ===Calling Tasks=== Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later. To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]) } We can put the code into the workflow now, and set about writing the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } ===Writing Tasks=== Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>. task stringify_number { } We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections. task stringify_number { input { Int the_number } # ??? output { String the_string # = ??? } } Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string # = ??? } } Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]). task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } } We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage. Then we can put our task into our WDL file: version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } ===Output Block=== Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren't supposed to need this, but you do need it in 1.1 and Toil doesn't actually send your outputs anywhere yet if you don't have one, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } output { Array[String] fizzbuzz_results = result } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array. ==Running the Workflow== Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs: echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json Then run it on the cluster with Toil: mkdir -p logs export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium" toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --caching false --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json Or locally: toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json =Debugging Workflows= Sometimes, your workflow won't work. Try these ideas for figuring out what is going wrong. ==Debugging Options== When debugging a workflow, make sure to run the workflow with <code>--logDebug</code>, to set the log level to <code>DEBUG</code>, and with <code>--jobStore /some/path/to/a/shared/directory/it/can/create</code> so that the stored files shipped between jobs are in a place you can access them. When debug logging is on, the log from every Toil job is inserted in the main Toil log between these markers: =========> Toil job log is here <========= Normally, only the logs of failing jobs and the output of commands run from WDL are reproduced like this. ==Reading the Log== When a WDL workflow fails, you are likely to see a message like this: WDL.runtime.error.CommandFailed: task command failed with exit status 1 [2023-07-16T16:23:54-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host phoenix-15.prism This means that the command line command specified by one of your WDL tasks exited with a failing (i.e. nonzero) exit code, which will happen when either the command line command is written wrong, or when the error detection code in the tool you are trying to run detects and reports an error. Go up higher in the log until you find lines that look like: [2024-01-16T20:12:19-0500] [Thread-3 (statsAndLoggingAggregator)] [I] [toil.statsAndLogging] hello_caller.0.hello.stderr follows: And [2024-01-16T20:12:19-0500] [Thread-3 (statsAndLoggingAggregator)] [I] [toil.statsAndLogging] hello_caller.0.hello.stdout follows: These will be followed by the standard error and standard output log data from the task's command. There may be useful information (such as an error message from the underlying tool) in there. ==Reproducing Problems== When trying to fix a failing step, it is useful to be able to run a command outside of Toil or WDL that might reproduce the problem. In addition to getting the standard output and standard error logs as described above, you may also need input files for your tool in order to do this. In the log of your failing Toil taks, look for likes like this: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-4f886176ab8344baaf17dc72fc445445/toplog.sh' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmprwhi6h3q/toplog.sh' [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmpjyksfoko/Sample.bam' ... The <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam</code> part is a Toil file ID, and it is a relative path from your <code>--jobStore</code> value to where the file is stored on disk. So if you ran the workflow with <code>--jobStore /private/groups/patenlab/anovak/jobstore</code>, you would look for this file at: /private/groups/patenlab/anovak/jobstore/files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam ==More Ways of Finding Files== Sometimes, a step might not fail, but you still might want to see the files it is using as input. If you have the job store path, you can use the <code>find</code> command to try and find the files by name. For example, if you want to look at <code>Sample.bam</code>, you can look for it like this: find /path/to/the/jobstore -name "Sample.bam" If you want to find files that were ''uploaded'' from a job, look for lines like this in the job's log: [2023-07-16T15:58:39-0700] [MainThread] [D] [toil.wdl.wdltoil] Virtualized /data/tmp/2846b6012e3e5535add03b363950dd78/cb23/197c/work/bamPerChrs/Sample.chr14.bam as WDL file toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam You can take the <code>toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam</code> URI and URL-decode it with, for example, [https://www.urldecoder.io/], getting this: toilfile:2703483274:0:files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam Then you can take the part after the last colon, <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam</code>, and that is the path relative to the job store where this file can be found. ==Using Development Versions of Toil== Sometimes, bugs will be fixed in the development version of Toil, but not released yet. To try the current development version of Toil, you can install it like this: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl,aws,google]' If you want to use a particular branch or commit, like <code>aaa451b320fc115b3563ced25cb501301cf86f90</code>, you can do: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git@aaa451b320fc115b3563ced25cb501301cf86f90#egg=toil[wdl,aws,google]' ==Frequently Asked Questions== ===I am getting warnings about <code>XDG_RUNTIME_DIR</code>=== You may be seeing warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code> This happens because Slurm is providing Toil with an <code>XDG_RUNTIME_DIR</code> environment variable that points to a directory that doesn't exist, which the XDG spec says it shouldn't be doing. This is a known bug in the GI Slurm configuration, and Toil is letting you know that it is working around it. ===Toil said it was <code>Redirecting logging</code> somewhere, but I can't find that file!=== The Toil worker process for each job will say that it is <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code>, and when running in single machine mode these messages go to the main Toil log. The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil. If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node. =Additional WDL resources= For more information on writing and running WDL workflows, see: * [https://docs.openwdl.org/en/stable/ The WDL dcoumentation] * [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube] * [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification] 5b18728125ac2936327740c6106f2903ca257792 469 468 2024-02-13T23:01:15Z Anovak 4 /* Debugging Workflows */ Explain restarting wikitext text/x-wiki '''Tutorial: Getting Started with WDL Workflows on Phoenix''' Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you. Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order. This tutorial will help you get started writing and running workflows. The '''Phoenix Cluster Setup''' section is specifically for the UC Santa Cruz Genomics Institute's Phoenix Slurm cluster. The other sections are broadly applicable to other environments. By the end, you will be able to run workflows on Slurm with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong. =Phoenix Cluster Setup= Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH. ==Getting VPN access== We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall. To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster. ==Connecting to Phoenix== Once you have VPN access, you can connect to the "head node" of the Phoenix cluster. This node is where everyone logs in, but you should *not* run actual work on this node; it exists only to give you access to the files on the cluster and to the commands to control cluster jobs. To connect to the head node: 1. Connect to the VPN. 2. SSH to <code>phoenix.prism</code>. At the command line, run: ssh phoenix.prism If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run: ssh flastname@phoenix.prism The first time you connect, you will see a message like: The authenticity of host 'phoenix.prism (10.50.1.66)' can't be established. ED25519 key fingerprint is SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI. This key is not known by any other names Are you sure you want to continue connecting (yes/no/[fingerprint])? This is your computer asking you to help it decide if it is talking to the genuine <code>phoenix.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is. ==Installing Toil with WDL support== Once you are on the head node, you can install Toil, a program for running workflows. When installing, you need to specify that you want WDL support. To do this, you can run: pip install --upgrade --user 'toil[wdl]' If you also want to use AWS S3 <code>s3://</code> and/or Google <code>gs://</code> URLs for data, you will need to also install Toil with the <code>aws</code> and <code>google</code> extras, respectively: pip install --upgrade --user 'toil[wdl,aws,google]' This will install Toil in the <code>.local</code> directory inside your home directory, which we write as <code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>. By default, the command interpreter *will not* look there, so if you type <code>toil-wdl-runner</code>, it will complain that the command is not found. To fix this, you need to configure the command interpreter (bash) to look where Toil is installed. To do this, run: echo 'export PATH="${HOME}/.local/bin:${PATH}"' >>~/.bashrc After that, **log out and log back in**, to restart bash and pick up the change. To make sure it worked, you can run: toil-wdl-runner --help If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports. If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above. ==Configuring your Phoenix Environment== '''Do not try and store data in your home directory on Phoenix!''' The home directories are meant for code and programs. Any data worth running a workflow on should be in a directory under <code>/private/groups</code>. You will probably need to email the admins to get added to a group so you can create a directory to work in somewhere under <code>/private/groups</code>. Usually you would end up with <code>/private/groups/YOURGROUPNAME/YOURUSERNAME</code>. Remember this path; we will need it later. ==Configuring Toil for Phoenix== Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. However, since these files can be large, and the home directory quota is only 30 GB, we can't keep these in your home directory. We will need to use the <code>/private/groups/YOURGROUPNAME/YOURUSERNAME</code> directory you created earlier. Make sure that that directory is available in your <code>~/.bashrc</code> file by editing and running this command: echo 'BIG_DATA_DIR=/private/groups/YOURGROUPNAME/YOURUSERNAME' >>~/.bashrc Then use these commands to make sure that Toil knows where it ought to put its caches: echo 'export SINGULARITY_CACHEDIR="${BIG_DATA_DIR}/.singularity/cache"' >>~/.bashrc echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="${BIG_DATA_DIR}/.cache/miniwdl"' >>~/.bashrc After that, '''log out and log back in again''', to apply the changes. If you don't do this, Toil will re-download each container image, on each node, for each run of each workflow. That wastes a lot of time, and can exhaust the [https://docs.docker.com/docker-hub/download-rate-limit/#whats-the-download-rate-limit-on-docker-hub limits on how many containers you are allowed to download each day]. '''If you get errors about mutexes, lock files, or other weird problems with Singularity''', try moving these directories to inside <code>/data/tmp</code> on the individual nodes, or unsetting them and letting Toil use its defaults (and exhaust our Docker pull limits). [https://github.com/DataBiosphere/toil/issues/4654 It is not clear that <code>/private/groups</code> actually implements the necessary file locking correctly.] =Running an existing workflow= First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project]. First, go to your user directory under <code>/private/groups</code>, and make a directory to work in. cd /private/groups/YOURGROUPNAME/YOURUSERNAME mkdir workflow-test cd workflow-test Next, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally. wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl ==Preparing an input file== Near the top of the WDL file, there's a section like this: workflow hello_caller { input { File who } This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one. So first, we have to make that list of names. Let's make it in <code>names.txt</code> echo "Mridula Resurrección" >names.txt echo "Gershom Šarlota" >>names.txt echo "Ritchie Ravi" >>names.txt Then, we need to create an *inputs file*, which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this: echo '{"hello_caller.who": "./names.txt"}' >inputs.json Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification]. ==Testing at small scale single-machine== We are now ready to run the workflow! You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running: srun -c 2 --mem 8G --time=02:00:00 --partition=medium --pty bash -i This will start a new shell that can run for 2 hours; to leave it and go back to the head node you can use <code>exit</code>. In your new shell, run this Toil command: toil-wdl-runner self_test.wdl inputs.json -o local_run This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>. This will print a lot of logging to standard error, and to standard output it will print: {"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]} The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person. To leave your interactive Slurm session and return to the head node, use <code>exit</code>. ==Running at larger scale== Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people! Go get this handy list of people and cut it to length: wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt head -n100 1000_names.txt >100_names.txt And make a new inputs file: echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster. To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. Until Toil [https://github.com/DataBiosphere/toil/issues/4775 gets support for data file caching on Slurm], we will also need the <code>--caching false</code> option. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it. Additionally, since [https://github.com/DataBiosphere/toil/issues/4686 Toil can't manage Slurm partitions itself], we will use the <code>TOIL_SLURM_ARGS</code> environment variable to tell Toil how long jobs should be allowed to run for (2 hours) and what [[Slurm Queues (Partitions) and Resource Management | partition]] they should go in. mkdir -p logs export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium" toil-wdl-runner --jobStore ./big_store --batchSystem slurm --caching false --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory. =Writing your own workflow= In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz]. ==Writing the file== ===Version=== All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files). Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0. So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement: version 1.0 ===Workflow Block=== Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>. version 1.0 workflow FizzBuzz { } ===Input Block=== Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } } Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is ''not'' optional, and there is no default value, then the user's inputs file ''must'' specify a value for it in order for the workflow to run. ===Body=== Now we'll start on the body of the workflow, to be inserted just after the inputs section. The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable. Array[Int] numbers = range(item_count) WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more. ===Scattering=== Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 } ===Conditionals=== Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array. Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only ''once'' in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>. Note that WDL supports conditional ''expressions'' with a <code>then</code> and an <code>else</code>, but conditional ''statements'' only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition. So first, let's handle the special cases. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. } } ===Calling Tasks=== Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later. To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]) } We can put the code into the workflow now, and set about writing the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } ===Writing Tasks=== Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>. task stringify_number { } We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections. task stringify_number { input { Int the_number } # ??? output { String the_string # = ??? } } Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string # = ??? } } Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]). task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } } We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage. Then we can put our task into our WDL file: version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } ===Output Block=== Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren't supposed to need this, but you do need it in 1.1 and Toil doesn't actually send your outputs anywhere yet if you don't have one, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } output { Array[String] fizzbuzz_results = result } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array. ==Running the Workflow== Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs: echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json Then run it on the cluster with Toil: mkdir -p logs export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium" toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --caching false --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json Or locally: toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json =Debugging Workflows= Sometimes, your workflow won't work. Try these ideas for figuring out what is going wrong. ==Restarting the Workflow== If you think your workflow failed from a transient problem (such as a Docker image not being available) that you have fixed, and you ran the workflow with <code>--jobStore</code> set manually to a directory that persists between attempts, you can add <code>--restart</code> to your workflow command and make Toil try again. It will pick up from where it left off and rerun any failed tasks and then the rest of the workflow. This will ''will not'' pick up any changes to your WDL source code files; those are read once at the beginning and not re-read on restart. If restarting the workflow doesn't help, you may need to move on to more advanced debugging techniques. ==Debugging Options== When debugging a workflow, make sure to run the workflow with <code>--logDebug</code>, to set the log level to <code>DEBUG</code>, and with <code>--jobStore /some/path/to/a/shared/directory/it/can/create</code> so that the stored files shipped between jobs are in a place you can access them. When debug logging is on, the log from every Toil job is inserted in the main Toil log between these markers: =========> Toil job log is here <========= Normally, only the logs of failing jobs and the output of commands run from WDL are reproduced like this. ==Reading the Log== When a WDL workflow fails, you are likely to see a message like this: WDL.runtime.error.CommandFailed: task command failed with exit status 1 [2023-07-16T16:23:54-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host phoenix-15.prism This means that the command line command specified by one of your WDL tasks exited with a failing (i.e. nonzero) exit code, which will happen when either the command line command is written wrong, or when the error detection code in the tool you are trying to run detects and reports an error. Go up higher in the log until you find lines that look like: [2024-01-16T20:12:19-0500] [Thread-3 (statsAndLoggingAggregator)] [I] [toil.statsAndLogging] hello_caller.0.hello.stderr follows: And [2024-01-16T20:12:19-0500] [Thread-3 (statsAndLoggingAggregator)] [I] [toil.statsAndLogging] hello_caller.0.hello.stdout follows: These will be followed by the standard error and standard output log data from the task's command. There may be useful information (such as an error message from the underlying tool) in there. ==Reproducing Problems== When trying to fix a failing step, it is useful to be able to run a command outside of Toil or WDL that might reproduce the problem. In addition to getting the standard output and standard error logs as described above, you may also need input files for your tool in order to do this. In the log of your failing Toil taks, look for likes like this: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-4f886176ab8344baaf17dc72fc445445/toplog.sh' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmprwhi6h3q/toplog.sh' [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmpjyksfoko/Sample.bam' ... The <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam</code> part is a Toil file ID, and it is a relative path from your <code>--jobStore</code> value to where the file is stored on disk. So if you ran the workflow with <code>--jobStore /private/groups/patenlab/anovak/jobstore</code>, you would look for this file at: /private/groups/patenlab/anovak/jobstore/files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam ==More Ways of Finding Files== Sometimes, a step might not fail, but you still might want to see the files it is using as input. If you have the job store path, you can use the <code>find</code> command to try and find the files by name. For example, if you want to look at <code>Sample.bam</code>, you can look for it like this: find /path/to/the/jobstore -name "Sample.bam" If you want to find files that were ''uploaded'' from a job, look for lines like this in the job's log: [2023-07-16T15:58:39-0700] [MainThread] [D] [toil.wdl.wdltoil] Virtualized /data/tmp/2846b6012e3e5535add03b363950dd78/cb23/197c/work/bamPerChrs/Sample.chr14.bam as WDL file toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam You can take the <code>toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam</code> URI and URL-decode it with, for example, [https://www.urldecoder.io/], getting this: toilfile:2703483274:0:files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam Then you can take the part after the last colon, <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam</code>, and that is the path relative to the job store where this file can be found. ==Using Development Versions of Toil== Sometimes, bugs will be fixed in the development version of Toil, but not released yet. To try the current development version of Toil, you can install it like this: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl,aws,google]' If you want to use a particular branch or commit, like <code>aaa451b320fc115b3563ced25cb501301cf86f90</code>, you can do: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git@aaa451b320fc115b3563ced25cb501301cf86f90#egg=toil[wdl,aws,google]' ==Frequently Asked Questions== ===I am getting warnings about <code>XDG_RUNTIME_DIR</code>=== You may be seeing warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code> This happens because Slurm is providing Toil with an <code>XDG_RUNTIME_DIR</code> environment variable that points to a directory that doesn't exist, which the XDG spec says it shouldn't be doing. This is a known bug in the GI Slurm configuration, and Toil is letting you know that it is working around it. ===Toil said it was <code>Redirecting logging</code> somewhere, but I can't find that file!=== The Toil worker process for each job will say that it is <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code>, and when running in single machine mode these messages go to the main Toil log. The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil. If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node. =Additional WDL resources= For more information on writing and running WDL workflows, see: * [https://docs.openwdl.org/en/stable/ The WDL dcoumentation] * [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube] * [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification] add1c8435b725cc7bfdf1fc832b7951158bc2554 473 469 2024-03-21T16:23:01Z Anovak 4 /* Preparing an input file */ wikitext text/x-wiki '''Tutorial: Getting Started with WDL Workflows on Phoenix''' Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you. Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order. This tutorial will help you get started writing and running workflows. The '''Phoenix Cluster Setup''' section is specifically for the UC Santa Cruz Genomics Institute's Phoenix Slurm cluster. The other sections are broadly applicable to other environments. By the end, you will be able to run workflows on Slurm with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong. =Phoenix Cluster Setup= Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH. ==Getting VPN access== We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall. To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster. ==Connecting to Phoenix== Once you have VPN access, you can connect to the "head node" of the Phoenix cluster. This node is where everyone logs in, but you should *not* run actual work on this node; it exists only to give you access to the files on the cluster and to the commands to control cluster jobs. To connect to the head node: 1. Connect to the VPN. 2. SSH to <code>phoenix.prism</code>. At the command line, run: ssh phoenix.prism If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run: ssh flastname@phoenix.prism The first time you connect, you will see a message like: The authenticity of host 'phoenix.prism (10.50.1.66)' can't be established. ED25519 key fingerprint is SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI. This key is not known by any other names Are you sure you want to continue connecting (yes/no/[fingerprint])? This is your computer asking you to help it decide if it is talking to the genuine <code>phoenix.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is. ==Installing Toil with WDL support== Once you are on the head node, you can install Toil, a program for running workflows. When installing, you need to specify that you want WDL support. To do this, you can run: pip install --upgrade --user 'toil[wdl]' If you also want to use AWS S3 <code>s3://</code> and/or Google <code>gs://</code> URLs for data, you will need to also install Toil with the <code>aws</code> and <code>google</code> extras, respectively: pip install --upgrade --user 'toil[wdl,aws,google]' This will install Toil in the <code>.local</code> directory inside your home directory, which we write as <code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>. By default, the command interpreter *will not* look there, so if you type <code>toil-wdl-runner</code>, it will complain that the command is not found. To fix this, you need to configure the command interpreter (bash) to look where Toil is installed. To do this, run: echo 'export PATH="${HOME}/.local/bin:${PATH}"' >>~/.bashrc After that, **log out and log back in**, to restart bash and pick up the change. To make sure it worked, you can run: toil-wdl-runner --help If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports. If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above. ==Configuring your Phoenix Environment== '''Do not try and store data in your home directory on Phoenix!''' The home directories are meant for code and programs. Any data worth running a workflow on should be in a directory under <code>/private/groups</code>. You will probably need to email the admins to get added to a group so you can create a directory to work in somewhere under <code>/private/groups</code>. Usually you would end up with <code>/private/groups/YOURGROUPNAME/YOURUSERNAME</code>. Remember this path; we will need it later. ==Configuring Toil for Phoenix== Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. However, since these files can be large, and the home directory quota is only 30 GB, we can't keep these in your home directory. We will need to use the <code>/private/groups/YOURGROUPNAME/YOURUSERNAME</code> directory you created earlier. Make sure that that directory is available in your <code>~/.bashrc</code> file by editing and running this command: echo 'BIG_DATA_DIR=/private/groups/YOURGROUPNAME/YOURUSERNAME' >>~/.bashrc Then use these commands to make sure that Toil knows where it ought to put its caches: echo 'export SINGULARITY_CACHEDIR="${BIG_DATA_DIR}/.singularity/cache"' >>~/.bashrc echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="${BIG_DATA_DIR}/.cache/miniwdl"' >>~/.bashrc After that, '''log out and log back in again''', to apply the changes. If you don't do this, Toil will re-download each container image, on each node, for each run of each workflow. That wastes a lot of time, and can exhaust the [https://docs.docker.com/docker-hub/download-rate-limit/#whats-the-download-rate-limit-on-docker-hub limits on how many containers you are allowed to download each day]. '''If you get errors about mutexes, lock files, or other weird problems with Singularity''', try moving these directories to inside <code>/data/tmp</code> on the individual nodes, or unsetting them and letting Toil use its defaults (and exhaust our Docker pull limits). [https://github.com/DataBiosphere/toil/issues/4654 It is not clear that <code>/private/groups</code> actually implements the necessary file locking correctly.] =Running an existing workflow= First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project]. First, go to your user directory under <code>/private/groups</code>, and make a directory to work in. cd /private/groups/YOURGROUPNAME/YOURUSERNAME mkdir workflow-test cd workflow-test Next, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally. wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl ==Preparing an input file== Near the top of the WDL file, there's a section like this: workflow hello_caller { input { File who } This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one. So first, we have to make that list of names. Let's make it in <code>names.txt</code> echo "Mridula Resurrección" >names.txt echo "Gershom Šarlota" >>names.txt echo "Ritchie Ravi" >>names.txt Then, we need to create an ''inputs file'', which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this: echo '{"hello_caller.who": "./names.txt"}' >inputs.json Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification]. ==Testing at small scale single-machine== We are now ready to run the workflow! You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running: srun -c 2 --mem 8G --time=02:00:00 --partition=medium --pty bash -i This will start a new shell that can run for 2 hours; to leave it and go back to the head node you can use <code>exit</code>. In your new shell, run this Toil command: toil-wdl-runner self_test.wdl inputs.json -o local_run This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>. This will print a lot of logging to standard error, and to standard output it will print: {"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]} The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person. To leave your interactive Slurm session and return to the head node, use <code>exit</code>. ==Running at larger scale== Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people! Go get this handy list of people and cut it to length: wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt head -n100 1000_names.txt >100_names.txt And make a new inputs file: echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster. To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. Until Toil [https://github.com/DataBiosphere/toil/issues/4775 gets support for data file caching on Slurm], we will also need the <code>--caching false</code> option. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it. Additionally, since [https://github.com/DataBiosphere/toil/issues/4686 Toil can't manage Slurm partitions itself], we will use the <code>TOIL_SLURM_ARGS</code> environment variable to tell Toil how long jobs should be allowed to run for (2 hours) and what [[Slurm Queues (Partitions) and Resource Management | partition]] they should go in. mkdir -p logs export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium" toil-wdl-runner --jobStore ./big_store --batchSystem slurm --caching false --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory. =Writing your own workflow= In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz]. ==Writing the file== ===Version=== All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files). Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0. So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement: version 1.0 ===Workflow Block=== Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>. version 1.0 workflow FizzBuzz { } ===Input Block=== Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } } Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is ''not'' optional, and there is no default value, then the user's inputs file ''must'' specify a value for it in order for the workflow to run. ===Body=== Now we'll start on the body of the workflow, to be inserted just after the inputs section. The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable. Array[Int] numbers = range(item_count) WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more. ===Scattering=== Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 } ===Conditionals=== Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array. Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only ''once'' in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>. Note that WDL supports conditional ''expressions'' with a <code>then</code> and an <code>else</code>, but conditional ''statements'' only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition. So first, let's handle the special cases. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. } } ===Calling Tasks=== Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later. To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]) } We can put the code into the workflow now, and set about writing the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } ===Writing Tasks=== Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>. task stringify_number { } We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections. task stringify_number { input { Int the_number } # ??? output { String the_string # = ??? } } Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string # = ??? } } Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]). task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } } We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage. Then we can put our task into our WDL file: version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } ===Output Block=== Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren't supposed to need this, but you do need it in 1.1 and Toil doesn't actually send your outputs anywhere yet if you don't have one, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } output { Array[String] fizzbuzz_results = result } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array. ==Running the Workflow== Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs: echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json Then run it on the cluster with Toil: mkdir -p logs export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium" toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --caching false --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json Or locally: toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json =Debugging Workflows= Sometimes, your workflow won't work. Try these ideas for figuring out what is going wrong. ==Restarting the Workflow== If you think your workflow failed from a transient problem (such as a Docker image not being available) that you have fixed, and you ran the workflow with <code>--jobStore</code> set manually to a directory that persists between attempts, you can add <code>--restart</code> to your workflow command and make Toil try again. It will pick up from where it left off and rerun any failed tasks and then the rest of the workflow. This will ''will not'' pick up any changes to your WDL source code files; those are read once at the beginning and not re-read on restart. If restarting the workflow doesn't help, you may need to move on to more advanced debugging techniques. ==Debugging Options== When debugging a workflow, make sure to run the workflow with <code>--logDebug</code>, to set the log level to <code>DEBUG</code>, and with <code>--jobStore /some/path/to/a/shared/directory/it/can/create</code> so that the stored files shipped between jobs are in a place you can access them. When debug logging is on, the log from every Toil job is inserted in the main Toil log between these markers: =========> Toil job log is here <========= Normally, only the logs of failing jobs and the output of commands run from WDL are reproduced like this. ==Reading the Log== When a WDL workflow fails, you are likely to see a message like this: WDL.runtime.error.CommandFailed: task command failed with exit status 1 [2023-07-16T16:23:54-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host phoenix-15.prism This means that the command line command specified by one of your WDL tasks exited with a failing (i.e. nonzero) exit code, which will happen when either the command line command is written wrong, or when the error detection code in the tool you are trying to run detects and reports an error. Go up higher in the log until you find lines that look like: [2024-01-16T20:12:19-0500] [Thread-3 (statsAndLoggingAggregator)] [I] [toil.statsAndLogging] hello_caller.0.hello.stderr follows: And [2024-01-16T20:12:19-0500] [Thread-3 (statsAndLoggingAggregator)] [I] [toil.statsAndLogging] hello_caller.0.hello.stdout follows: These will be followed by the standard error and standard output log data from the task's command. There may be useful information (such as an error message from the underlying tool) in there. ==Reproducing Problems== When trying to fix a failing step, it is useful to be able to run a command outside of Toil or WDL that might reproduce the problem. In addition to getting the standard output and standard error logs as described above, you may also need input files for your tool in order to do this. In the log of your failing Toil taks, look for likes like this: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-4f886176ab8344baaf17dc72fc445445/toplog.sh' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmprwhi6h3q/toplog.sh' [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmpjyksfoko/Sample.bam' ... The <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam</code> part is a Toil file ID, and it is a relative path from your <code>--jobStore</code> value to where the file is stored on disk. So if you ran the workflow with <code>--jobStore /private/groups/patenlab/anovak/jobstore</code>, you would look for this file at: /private/groups/patenlab/anovak/jobstore/files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam ==More Ways of Finding Files== Sometimes, a step might not fail, but you still might want to see the files it is using as input. If you have the job store path, you can use the <code>find</code> command to try and find the files by name. For example, if you want to look at <code>Sample.bam</code>, you can look for it like this: find /path/to/the/jobstore -name "Sample.bam" If you want to find files that were ''uploaded'' from a job, look for lines like this in the job's log: [2023-07-16T15:58:39-0700] [MainThread] [D] [toil.wdl.wdltoil] Virtualized /data/tmp/2846b6012e3e5535add03b363950dd78/cb23/197c/work/bamPerChrs/Sample.chr14.bam as WDL file toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam You can take the <code>toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam</code> URI and URL-decode it with, for example, [https://www.urldecoder.io/], getting this: toilfile:2703483274:0:files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam Then you can take the part after the last colon, <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam</code>, and that is the path relative to the job store where this file can be found. ==Using Development Versions of Toil== Sometimes, bugs will be fixed in the development version of Toil, but not released yet. To try the current development version of Toil, you can install it like this: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl,aws,google]' If you want to use a particular branch or commit, like <code>aaa451b320fc115b3563ced25cb501301cf86f90</code>, you can do: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git@aaa451b320fc115b3563ced25cb501301cf86f90#egg=toil[wdl,aws,google]' ==Frequently Asked Questions== ===I am getting warnings about <code>XDG_RUNTIME_DIR</code>=== You may be seeing warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code> This happens because Slurm is providing Toil with an <code>XDG_RUNTIME_DIR</code> environment variable that points to a directory that doesn't exist, which the XDG spec says it shouldn't be doing. This is a known bug in the GI Slurm configuration, and Toil is letting you know that it is working around it. ===Toil said it was <code>Redirecting logging</code> somewhere, but I can't find that file!=== The Toil worker process for each job will say that it is <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code>, and when running in single machine mode these messages go to the main Toil log. The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil. If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node. =Additional WDL resources= For more information on writing and running WDL workflows, see: * [https://docs.openwdl.org/en/stable/ The WDL dcoumentation] * [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube] * [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification] 60ab14c68d40978f9c52e7864bd2b116166d3f1e Slurm Tips for Toil 0 38 470 441 2024-02-16T19:08:32Z Anovak 4 wikitext text/x-wiki Here are some tips for running Toil workflows on the Phoenix Slurm cluster. Mostly you might want to run WDL workflows, but you can use some of these for other workflows like Cactus. You can also consult [https://github.com/DataBiosphere/toil/blob/master/docs/wdl/running.rst the Toil documentation on WDL workflows]. * Install Toil with WDL support with: pip3 install --upgrade toil[wdl] To use a development version of Toil, you can install from source instead: pip3 install git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl] Or for a particular branch: pip3 install git+https://github.com/DataBiosphere/toil.git@issues/123-abc#egg=toil[wdl] * You will then need to make sure your '''~/.local/bin''' directory is on your PATH. Open up your '''~/.bashrc''' file and add: export PATH=$PATH:$HOME/.local/bin Then make sure to log out and back in again. * For Toil options, you will want '''--batchSystem slurm''' to make it use Slurm and '''--batchLogsDir ./logs''' (or some other location on a shared filesystem) for the Slurm logs to not get lost. * You may be able to speed up your workflow with '''--caching true''', to cache data on nodes to be shared among multiple simultaneous tasks. * If using '''toil-wdl-runner''', you might want to add '''--jobStore ./jobStore''' to make sure the job store is in a defined, shared location so that you can use '''--restart''' later. * If using '''toil-wdl-runner''', you will want to set the '''SINGULARITY_CACHEDIR''' and '''MINIWDL__SINGULARITY__IMAGE_CACHE''' environment variables for your workflow to locations on shared storage, and possibly to the default cache locations in your home directory. Otherwise Toil will set them to node-local directories for each node, and thus re-download images for each workflow run, and for each cluster node. To avoid this, you could, for example, before your run or in your '''~/.bashrc''' you could: export SINGULARITY_CACHEDIR=$HOME/.singularity/cache export MINIWDL__SINGULARITY__IMAGE_CACHE=$HOME/.cache/miniwdl b43add8e50f9203265cde9883be4edcf9b77afa3 Cluster Etiquette 0 47 471 431 2024-03-06T17:24:37Z Anovak 4 Link to the storage visualization wikitext text/x-wiki When running jobs on the cluster, you must be very aware of how those jobs will affect other users. 1: Always test your job by running one first. Just one. Note how much RAM, how many CPU cores and how much time it takes to run. Then, when you submit 50 or 100 of those, you can specify limits in your Slurm batch file on how long the job should run, how much RAM it should use and how much time it takes. In that case, slurm can stop jibs that inadvertantly go too long or use too many resources. 2: Don't run too many jobs at once if they use a lot of disk I/O. If every job reads in a 100GB file, and you launch 20 of them at the same time, you could bring down the file server serving /private/groups. Run only maybe 5 at once in that case, or introduce a random delay at the start of your jobs. You can limit your concurrent jobs by specifying something like this in your job batch file: #SBATCH --array=[1-279]%10 inputList=$1 input=$(sed -n "$SLURM_ARRAY_TASK_ID"p $inputList) some_command $input 3: Don't use too much storage. Use http://logserv.gi.ucsc.edu/cgi-bin/private-groups.cgi to look at how your storage use is divided among your directories, and clean up large chunks of data that you do not need. 8a3f92f2ab671c05374adc8dc82fbfac9d85c741 482 471 2024-05-03T03:24:15Z Weiler 3 wikitext text/x-wiki When running jobs on the cluster, you must be very aware of how those jobs will affect other users. 1: Always test your job by running one first. Just one. Note how much RAM, how many CPU cores and how much time it takes to run. Then, when you submit 50 or 100 of those, you can specify limits in your Slurm batch file on how long the job should run, how much RAM it should use and how much time it takes. In that case, slurm can stop jobs that inadvertently go too long or use too many resources. 2: Don't run too many jobs at once if they use a lot of disk I/O. If every job reads in a 100GB file, and you launch 20 of them at the same time, you could bring down the file server serving /private/groups. Run only maybe 5 at once in that case, or introduce a random delay at the start of your jobs. You can limit your concurrent jobs by specifying something like this in your job batch file: #SBATCH --array=[1-279]%10 inputList=$1 input=$(sed -n "$SLURM_ARRAY_TASK_ID"p $inputList) some_command $input 3: Don't use too much storage. Use http://logserv.gi.ucsc.edu/cgi-bin/private-groups.cgi to look at how your storage use is divided among your directories, and clean up large chunks of data that you do not need. ba2478796c2c7e3eba71f35ff27389503ee58b65 483 482 2024-05-03T03:25:13Z Weiler 3 wikitext text/x-wiki When running jobs on the cluster, you must be very aware of how those jobs will affect other users. 1: Always test your job by running one first. Just one. Note how much RAM, how many CPU cores and how much time it takes to run. Then, when you submit 50 or 100 of those, you can specify limits in your Slurm batch file on how long the job should run, how much RAM it should use and how much time it takes. In that case, slurm can stop jobs that inadvertently go too long or use too many resources. 2: Don't run too many jobs at once if they use a lot of disk I/O. If every job reads in a 100GB file, and you launch 20 of them at the same time, you could bring down the /private/groups filesystem. Run only maybe 5 at once in that case, or introduce a random delay at the start of your jobs. You can limit your concurrent jobs by specifying something like this in your job batch file: #SBATCH --array=[1-279]%10 inputList=$1 input=$(sed -n "$SLURM_ARRAY_TASK_ID"p $inputList) some_command $input 3: Don't use too much storage. Use http://logserv.gi.ucsc.edu/cgi-bin/private-groups.cgi to look at how your storage use is divided among your directories, and clean up large chunks of data that you do not need. 543b1e1f5990fe973473a67c430cb5c81be58e7f AWS Account List and Numbers 0 22 472 424 2024-03-19T18:07:32Z Weiler 3 wikitext text/x-wiki This is a list of our currently available AWS accounts and their account numbers: ucsc-bd2k : 862902209576 ucsc-toil-dev : 318423852362 ucsc-vg-dev : 781907127277 ucsc-platform-dev : 719818754276 comparative-genomics-dev : 162786355865 nanopore-dev : 270442831226 ucsc-cgp-production : 097093801910 platform-hca-dev : 122796619775 anvil-dev : 608666466534 gi-gateway : 652235167018 pangenomics : 422448306679 braingeneers : 443872533066 ucsctreehouse : 238605363322 ucsc-bisti-dev : 851631505710 ucsc-genome-browser : 784962239183 dockstore-dev : 635220370222 ucsc-spatial : 541180793903 platform-hca-prod : 542754589326 platform-hca-portal : 158963592881 miga-lab : 156518225147 platform-anvil-dev : 289950828509 platform-anvil-prod : 465330168186 platform-anvil-portal : 166384485414 agc-runs : 598929688444 sequencing-center-cold-store : 436140841220 hprc-training : 654654365441 aa43f5df3b8d3a96e4ddf56f11a8172efd2e6994 488 472 2024-05-10T16:16:49Z Weiler 3 wikitext text/x-wiki This is a list of our currently available AWS accounts and their account numbers: ucsc-bd2k : 862902209576 ucsc-toil-dev : 318423852362 ucsc-vg-dev : 781907127277 ucsc-platform-dev : 719818754276 comparative-genomics-dev : 162786355865 nanopore-dev : 270442831226 ucsc-cgp-production : 097093801910 platform-hca-dev : 122796619775 anvil-dev : 608666466534 gi-gateway : 652235167018 pangenomics : 422448306679 braingeneers : 443872533066 ucsctreehouse : 238605363322 ucsc-bisti-dev : 851631505710 ucsc-genome-browser : 784962239183 dockstore-dev : 635220370222 ucsc-spatial : 541180793903 platform-hca-prod : 542754589326 platform-hca-portal : 158963592881 miga-lab : 156518225147 platform-anvil-dev : 289950828509 platform-anvil-prod : 465330168186 platform-anvil-portal : 166384485414 platform-temp-dev : 654654270592 agc-runs : 598929688444 sequencing-center-cold-store : 436140841220 hprc-training : 654654365441 fb98890fecca1981f140200ec27b12a4be3e9228 Firewalled Environment Storage Overview 0 39 475 393 2024-03-27T17:48:47Z Weiler 3 wikitext text/x-wiki == Storage == Our servers mount two types of ''shared'' storage; home directories and group storage directories. These home directories will mount over the network to all shared compute servers and the phoenix cluster, so any server you login to will have these filesystems available: '''Filesystem Specifications''' {| class="wikitable" style="text-align:center;" |- style="font-weight:bold; text-align:left;" ! Filesystem<br /> ! /private/home ! /private/groups |- | style="font-weight:bold; text-align:left;" | Default Soft Quota | 30 GB | 15 TB |- | style="font-weight:bold; text-align:left;" | Default Hard Quota | 31 GB | 16 TB |- | style="font-weight:bold; text-align:left;" | Total Capacity | 19 TB | 500 TB |- style="text-align:left;" | style="font-weight:bold;" | Access Speed | Very Fast (NVMe Flash Media) | Very Fast (NVMe Flash Media) |- style="text-align:left;" | style="font-weight:bold;" | Intended Use | This space should be used for login scripts, small bits of code or software repos, etc. No large data should be stored here. | This space should be used for large computational/shared data, large software installations and the like. |} '''Home Directories (/private/home/username)''' Your home directory will be located as "/private/home/username" and has a 30GB soft quota and a 31GB hard quota. Your home directory is meant for small scripts and login data, or a git repo. Please do not try to store large data there or computer on large jobs using data in your home directory. '''Groups Directories (/private/groups/groupname)''' The group storage directories are created per PI, and each group directory has a default 15TB hard quota. For example, if David Haussler is the PI that you report to directly, then the directory would exist as /private/groups/hausslerlab. Request access to that group directory and you will then be able to write to it. Each of those group directories are shared by the lab it belongs to, so you must be wary of everyone's data usage and share the 15TB available per group accordingly. On the compute servers you can check your group's current quota usage by using the 'getfattr' command. You can only check the quota of a group you are part of (you would be a member of the UNIX group of the same name). If you wanted to check the quota usage of /private/groups/hausslerlab for example, you would do: $ getfattr -n ceph.dir.rbytes /private/groups/hausslerlab getfattr: Removing leading '/' from absolute path names # file: private/groups/hausslerlab ceph.dir.rbytes="6522955553147" That number is in bytes. So divide by 1,000,000,000,000 and you get '6.522 TB'. That is how much data is currently being used. To check the max quota limit, use this command: $ getfattr -n ceph.quota.max_bytes /private/groups/hausslerlab getfattr: Removing leading '/' from absolute path names # file: private/groups/hausslerlab ceph.quota.max_bytes="15000000000000" And 15000000000000 divided by 1,000,000,000,000 is 15 TB. == /data/scratch Space on the Servers == Each server will generally have a local /data/scratch filesystem that you can use to store temporary files. '''BE ADVISED''' that /data/scratch is not backed up, and the data there could disappear in the event of a disk failure or anything else. Do not store important data there. If it is important, it should be moved somewhere else very soon after creation. 0cc13680fb8380dc6e8398f5aa7cd4b309526b1c 476 475 2024-03-27T18:00:45Z Weiler 3 /* Storage */ wikitext text/x-wiki == Storage == Our servers mount two types of ''shared'' storage; home directories and group storage directories. These home directories will mount over the network to all shared compute servers and the phoenix cluster, so any server you login to will have these filesystems available: '''Filesystem Specifications''' {| class="wikitable" style="text-align:center;" |- style="font-weight:bold; text-align:left;" ! Filesystem<br /> ! /private/home ! /private/groups |- | style="font-weight:bold; text-align:left;" | Default Soft Quota | 30 GB | 15 TB |- | style="font-weight:bold; text-align:left;" | Default Hard Quota | 31 GB | 15 TB |- | style="font-weight:bold; text-align:left;" | Total Capacity | 19 TB | 500 TB |- style="text-align:left;" | style="font-weight:bold;" | Access Speed | Very Fast (NVMe Flash Media) | Very Fast (NVMe Flash Media) |- style="text-align:left;" | style="font-weight:bold;" | Intended Use | This space should be used for login scripts, small bits of code or software repos, etc. No large data should be stored here. | This space should be used for large computational/shared data, large software installations and the like. |} '''Home Directories (/private/home/username)''' Your home directory will be located as "/private/home/username" and has a 30GB soft quota and a 31GB hard quota. Your home directory is meant for small scripts and login data, or a git repo. Please do not try to store large data there or computer on large jobs using data in your home directory. '''Groups Directories (/private/groups/groupname)''' The group storage directories are created per PI, and each group directory has a default 15TB hard quota. For example, if David Haussler is the PI that you report to directly, then the directory would exist as /private/groups/hausslerlab. Request access to that group directory and you will then be able to write to it. Each of those group directories are shared by the lab it belongs to, so you must be wary of everyone's data usage and share the 15TB available per group accordingly. On the compute servers you can check your group's current quota usage by using the 'getfattr' command. You can only check the quota of a group you are part of (you would be a member of the UNIX group of the same name). If you wanted to check the quota usage of /private/groups/hausslerlab for example, you would do: $ getfattr -n ceph.dir.rbytes /private/groups/hausslerlab getfattr: Removing leading '/' from absolute path names # file: private/groups/hausslerlab ceph.dir.rbytes="6522955553147" That number is in bytes. So divide by 1,000,000,000,000 and you get '6.522 TB'. That is how much data is currently being used. To check the max quota limit, use this command: $ getfattr -n ceph.quota.max_bytes /private/groups/hausslerlab getfattr: Removing leading '/' from absolute path names # file: private/groups/hausslerlab ceph.quota.max_bytes="15000000000000" And 15000000000000 divided by 1,000,000,000,000 is 15 TB. == /data/scratch Space on the Servers == Each server will generally have a local /data/scratch filesystem that you can use to store temporary files. '''BE ADVISED''' that /data/scratch is not backed up, and the data there could disappear in the event of a disk failure or anything else. Do not store important data there. If it is important, it should be moved somewhere else very soon after creation. dbcf1a58eedb9940688ab2364a2928469855b5d7 477 476 2024-03-27T18:01:08Z Weiler 3 /* Storage */ wikitext text/x-wiki == Storage == Our servers mount two types of ''shared'' storage; home directories and group storage directories. These home directories will mount over the network to all shared compute servers and the phoenix cluster, so any server you login to will have these filesystems available: '''Filesystem Specifications''' {| class="wikitable" style="text-align:center;" |- style="font-weight:bold; text-align:left;" ! Filesystem<br /> ! /private/home ! /private/groups |- | style="font-weight:bold; text-align:left;" | Default Soft Quota | 30 GB | 15 TB |- | style="font-weight:bold; text-align:left;" | Default Hard Quota | 31 GB | 15 TB |- | style="font-weight:bold; text-align:left;" | Total Capacity | 19 TB | 800 TB |- style="text-align:left;" | style="font-weight:bold;" | Access Speed | Very Fast (NVMe Flash Media) | Very Fast (NVMe Flash Media) |- style="text-align:left;" | style="font-weight:bold;" | Intended Use | This space should be used for login scripts, small bits of code or software repos, etc. No large data should be stored here. | This space should be used for large computational/shared data, large software installations and the like. |} '''Home Directories (/private/home/username)''' Your home directory will be located as "/private/home/username" and has a 30GB soft quota and a 31GB hard quota. Your home directory is meant for small scripts and login data, or a git repo. Please do not try to store large data there or computer on large jobs using data in your home directory. '''Groups Directories (/private/groups/groupname)''' The group storage directories are created per PI, and each group directory has a default 15TB hard quota. For example, if David Haussler is the PI that you report to directly, then the directory would exist as /private/groups/hausslerlab. Request access to that group directory and you will then be able to write to it. Each of those group directories are shared by the lab it belongs to, so you must be wary of everyone's data usage and share the 15TB available per group accordingly. On the compute servers you can check your group's current quota usage by using the 'getfattr' command. You can only check the quota of a group you are part of (you would be a member of the UNIX group of the same name). If you wanted to check the quota usage of /private/groups/hausslerlab for example, you would do: $ getfattr -n ceph.dir.rbytes /private/groups/hausslerlab getfattr: Removing leading '/' from absolute path names # file: private/groups/hausslerlab ceph.dir.rbytes="6522955553147" That number is in bytes. So divide by 1,000,000,000,000 and you get '6.522 TB'. That is how much data is currently being used. To check the max quota limit, use this command: $ getfattr -n ceph.quota.max_bytes /private/groups/hausslerlab getfattr: Removing leading '/' from absolute path names # file: private/groups/hausslerlab ceph.quota.max_bytes="15000000000000" And 15000000000000 divided by 1,000,000,000,000 is 15 TB. == /data/scratch Space on the Servers == Each server will generally have a local /data/scratch filesystem that you can use to store temporary files. '''BE ADVISED''' that /data/scratch is not backed up, and the data there could disappear in the event of a disk failure or anything else. Do not store important data there. If it is important, it should be moved somewhere else very soon after creation. 9108e2840e61e506f31426ae708bf9c3a232de48 491 477 2024-05-13T16:30:25Z Weiler 3 wikitext text/x-wiki == Storage == Our servers mount two types of ''shared'' storage; home directories and group storage directories. These home directories will mount over the network to all shared compute servers and the phoenix cluster, so any server you login to will have these filesystems available: '''Filesystem Specifications''' {| class="wikitable" style="text-align:center;" |- style="font-weight:bold; text-align:left;" ! Filesystem<br /> ! /private/home ! /private/groups |- | style="font-weight:bold; text-align:left;" | Default Soft Quota | 30 GB | 15 TB |- | style="font-weight:bold; text-align:left;" | Default Hard Quota | 31 GB | 15 TB |- | style="font-weight:bold; text-align:left;" | Total Capacity | 19 TB | 800 TB |- style="text-align:left;" | style="font-weight:bold;" | Access Speed | Very Fast (NVMe Flash Media) | Very Fast (NVMe Flash Media) |- style="text-align:left;" | style="font-weight:bold;" | Intended Use | This space should be used for login scripts, small bits of code or software repos, etc. No large data should be stored here. | This space should be used for large computational/shared data, large software installations and the like. |} '''Home Directories (/private/home/username)''' Your home directory will be located as "/private/home/username" and has a 30GB soft quota and a 31GB hard quota. Your home directory is meant for small scripts and login data, or a git repo. Please do not try to store large data there or computer on large jobs using data in your home directory. '''Groups Directories (/private/groups/groupname)''' The group storage directories are created per PI, and each group directory has a default 15TB hard quota. For example, if David Haussler is the PI that you report to directly, then the directory would exist as /private/groups/hausslerlab. Request access to that group directory and you will then be able to write to it. Each of those group directories are shared by the lab it belongs to, so you must be wary of everyone's data usage and share the 15TB available per group accordingly. On the compute servers you can check your group's current quota usage by using the 'getfattr' command. You can only check the quota of a group you are part of (you would be a member of the UNIX group of the same name). If you wanted to check the quota usage of /private/groups/hausslerlab for example, you would do: $ getfattr -n ceph.dir.rbytes /private/groups/hausslerlab getfattr: Removing leading '/' from absolute path names # file: private/groups/hausslerlab ceph.dir.rbytes="6522955553147" That number is in bytes. So divide by 1,000,000,000,000 and you get '6.522 TB'. That is how much data is currently being used. To check the max quota limit, use this command: $ getfattr -n ceph.quota.max_bytes /private/groups/hausslerlab getfattr: Removing leading '/' from absolute path names # file: private/groups/hausslerlab ceph.quota.max_bytes="15000000000000" And 15000000000000 divided by 1,000,000,000,000 is 15 TB. == Storage Quota Alerting == If you and/or folks in your lab would like an automated alert when the /private/groups/labname quota is getting to a certain percentage of fullness, we can set that up for you and others in your lab. Just email cluster-admin@soe.ucsc.edu with the following information: 1: Which directory you would like to watch quotas on (i.e. /private/groups/somelab) 2: What % full you would like an email alert at 3: What email addresses you want on the alert list After setup, our alerting system will alert folks on that email list every 4 hours until the quota in question is reduced to an amount under the alerting % threshold you asked for. So it is a bit noisy, but will force folks to delete data in order to stop the alerts. When the system notices that the quota usage has decreased to under the alert threshold, you will receive one final email with an "OK" notification that things are OK now. == /data/scratch Space on the Servers == Each server will generally have a local /data/scratch filesystem that you can use to store temporary files. '''BE ADVISED''' that /data/scratch is not backed up, and the data there could disappear in the event of a disk failure or anything else. Do not store important data there. If it is important, it should be moved somewhere else very soon after creation. a623ff379efd6e0fa4f395832ae6a1d924a64a3b 492 491 2024-05-13T16:31:28Z Weiler 3 /* Storage Quota Alerting */ wikitext text/x-wiki == Storage == Our servers mount two types of ''shared'' storage; home directories and group storage directories. These home directories will mount over the network to all shared compute servers and the phoenix cluster, so any server you login to will have these filesystems available: '''Filesystem Specifications''' {| class="wikitable" style="text-align:center;" |- style="font-weight:bold; text-align:left;" ! Filesystem<br /> ! /private/home ! /private/groups |- | style="font-weight:bold; text-align:left;" | Default Soft Quota | 30 GB | 15 TB |- | style="font-weight:bold; text-align:left;" | Default Hard Quota | 31 GB | 15 TB |- | style="font-weight:bold; text-align:left;" | Total Capacity | 19 TB | 800 TB |- style="text-align:left;" | style="font-weight:bold;" | Access Speed | Very Fast (NVMe Flash Media) | Very Fast (NVMe Flash Media) |- style="text-align:left;" | style="font-weight:bold;" | Intended Use | This space should be used for login scripts, small bits of code or software repos, etc. No large data should be stored here. | This space should be used for large computational/shared data, large software installations and the like. |} '''Home Directories (/private/home/username)''' Your home directory will be located as "/private/home/username" and has a 30GB soft quota and a 31GB hard quota. Your home directory is meant for small scripts and login data, or a git repo. Please do not try to store large data there or computer on large jobs using data in your home directory. '''Groups Directories (/private/groups/groupname)''' The group storage directories are created per PI, and each group directory has a default 15TB hard quota. For example, if David Haussler is the PI that you report to directly, then the directory would exist as /private/groups/hausslerlab. Request access to that group directory and you will then be able to write to it. Each of those group directories are shared by the lab it belongs to, so you must be wary of everyone's data usage and share the 15TB available per group accordingly. On the compute servers you can check your group's current quota usage by using the 'getfattr' command. You can only check the quota of a group you are part of (you would be a member of the UNIX group of the same name). If you wanted to check the quota usage of /private/groups/hausslerlab for example, you would do: $ getfattr -n ceph.dir.rbytes /private/groups/hausslerlab getfattr: Removing leading '/' from absolute path names # file: private/groups/hausslerlab ceph.dir.rbytes="6522955553147" That number is in bytes. So divide by 1,000,000,000,000 and you get '6.522 TB'. That is how much data is currently being used. To check the max quota limit, use this command: $ getfattr -n ceph.quota.max_bytes /private/groups/hausslerlab getfattr: Removing leading '/' from absolute path names # file: private/groups/hausslerlab ceph.quota.max_bytes="15000000000000" And 15000000000000 divided by 1,000,000,000,000 is 15 TB. == Storage Quota Alerting == If you and/or folks in your lab would like an automated alert when the /private/groups/labname quota is getting to a certain percentage of fullness, we can set that up for you and others in your lab. Just email '''cluster-admin@soe.ucsc.edu''' with the following information: 1: Which directory you would like to watch quotas on (i.e. /private/groups/somelab) 2: What % full you would like an email alert at 3: What email addresses you want on the alert list After setup, our alerting system will alert folks on that email list ''every 4 hours'' until the quota in question is reduced to an amount under the alerting % threshold you asked for. So it is a bit noisy, but will force folks to delete data in order to stop the alerts. When the system notices that the quota usage has decreased to under the alert threshold, you will receive one final email with an "OK" notification that things are OK now. == /data/scratch Space on the Servers == Each server will generally have a local /data/scratch filesystem that you can use to store temporary files. '''BE ADVISED''' that /data/scratch is not backed up, and the data there could disappear in the event of a disk failure or anything else. Do not store important data there. If it is important, it should be moved somewhere else very soon after creation. 4c847065b805be2a55dbd1bcf9548a019b2ab5ed 493 492 2024-05-14T18:21:34Z Weiler 3 wikitext text/x-wiki == Storage == Our servers mount two types of ''shared'' storage; home directories and group storage directories. These home directories will mount over the network to all shared compute servers and the phoenix cluster, so any server you login to will have these filesystems available: '''Filesystem Specifications''' {| class="wikitable" style="text-align:center;" |- style="font-weight:bold; text-align:left;" ! Filesystem<br /> ! /private/home ! /private/groups |- | style="font-weight:bold; text-align:left;" | Default Soft Quota | 30 GB | 15 TB |- | style="font-weight:bold; text-align:left;" | Default Hard Quota | 31 GB | 15 TB |- | style="font-weight:bold; text-align:left;" | Total Capacity | 19 TB | 800 TB |- style="text-align:left;" | style="font-weight:bold;" | Access Speed | Very Fast (NVMe Flash Media) | Very Fast (NVMe Flash Media) |- style="text-align:left;" | style="font-weight:bold;" | Intended Use | This space should be used for login scripts, small bits of code or software repos, etc. No large data should be stored here. | This space should be used for large computational/shared data, large software installations and the like. |} '''Home Directories (/private/home/username)''' Your home directory will be located as "/private/home/username" and has a 30GB soft quota and a 31GB hard quota. Your home directory is meant for small scripts and login data, or a git repo. Please do not try to store large data there or computer on large jobs using data in your home directory. '''Groups Directories (/private/groups/groupname)''' The group storage directories are created per PI, and each group directory has a default 15TB hard quota. For example, if David Haussler is the PI that you report to directly, then the directory would exist as /private/groups/hausslerlab. Request access to that group directory and you will then be able to write to it. Each of those group directories are shared by the lab it belongs to, so you must be wary of everyone's data usage and share the 15TB available per group accordingly. On the compute servers you can check your group's current quota usage by using the 'getfattr' command. You can only check the quota of a group you are part of (you would be a member of the UNIX group of the same name). If you wanted to check the quota usage of /private/groups/hausslerlab for example, you would do: $ getfattr -n ceph.dir.rbytes /private/groups/hausslerlab getfattr: Removing leading '/' from absolute path names # file: private/groups/hausslerlab ceph.dir.rbytes="6522955553147" That number is in bytes. So divide by 1,000,000,000,000 and you get '6.522 TB'. That is how much data is currently being used. To check the max quota limit, use this command: $ getfattr -n ceph.quota.max_bytes /private/groups/hausslerlab getfattr: Removing leading '/' from absolute path names # file: private/groups/hausslerlab ceph.quota.max_bytes="15000000000000" And 15000000000000 divided by 1,000,000,000,000 is 15 TB. == Storage Quota Alerting == If you and/or folks in your lab would like an automated alert when the /private/groups/labname quota is getting to a certain percentage of fullness, we can set that up for you and others in your lab. Just email '''cluster-admin@soe.ucsc.edu''' with the following information: 1: Which directory you would like to watch quotas on (i.e. /private/groups/somelab) 2: What % full you would like an email alert at 3: What email addresses you want on the alert list After setup, our alerting system will alert folks on that email list ''every 4 hours'' until the quota in question is reduced to an amount under the alerting % threshold you asked for. So it is a bit noisy, but will force folks to delete data in order to stop the alerts. When the system notices that the quota usage has decreased to under the alert threshold, you will receive one final email with an "OK" notification that things are OK now. == /data/scratch Space on the Servers == Each server will generally have a local /data/scratch filesystem that you can use to store temporary files. '''BE ADVISED''' that /data/scratch is not backed up, and the data there could disappear in the event of a disk failure or anything else. Do not store important data there. If it is important, it should be moved somewhere else very soon after creation. == Backups == /private/groups is backed up weekly on Friday nights (which usually takes several days to complete). Please not that the following directories in the tree '''WILL NOT''' be backed up: tmp/ temp/ TMP/ TEMP/ cache/ .cache/ *.tmp/ So if you have data that you know isn't important and should be excluded from the backups, put them in a directory suffixed with ".tmp". Such as this example: /private/groups/clusteradmin/mybams.tmp/ 51a2bc970e10bbc14cf33117a1779b3bd8a0b0e8 494 493 2024-05-14T18:21:55Z Weiler 3 /* Backups */ wikitext text/x-wiki == Storage == Our servers mount two types of ''shared'' storage; home directories and group storage directories. These home directories will mount over the network to all shared compute servers and the phoenix cluster, so any server you login to will have these filesystems available: '''Filesystem Specifications''' {| class="wikitable" style="text-align:center;" |- style="font-weight:bold; text-align:left;" ! Filesystem<br /> ! /private/home ! /private/groups |- | style="font-weight:bold; text-align:left;" | Default Soft Quota | 30 GB | 15 TB |- | style="font-weight:bold; text-align:left;" | Default Hard Quota | 31 GB | 15 TB |- | style="font-weight:bold; text-align:left;" | Total Capacity | 19 TB | 800 TB |- style="text-align:left;" | style="font-weight:bold;" | Access Speed | Very Fast (NVMe Flash Media) | Very Fast (NVMe Flash Media) |- style="text-align:left;" | style="font-weight:bold;" | Intended Use | This space should be used for login scripts, small bits of code or software repos, etc. No large data should be stored here. | This space should be used for large computational/shared data, large software installations and the like. |} '''Home Directories (/private/home/username)''' Your home directory will be located as "/private/home/username" and has a 30GB soft quota and a 31GB hard quota. Your home directory is meant for small scripts and login data, or a git repo. Please do not try to store large data there or computer on large jobs using data in your home directory. '''Groups Directories (/private/groups/groupname)''' The group storage directories are created per PI, and each group directory has a default 15TB hard quota. For example, if David Haussler is the PI that you report to directly, then the directory would exist as /private/groups/hausslerlab. Request access to that group directory and you will then be able to write to it. Each of those group directories are shared by the lab it belongs to, so you must be wary of everyone's data usage and share the 15TB available per group accordingly. On the compute servers you can check your group's current quota usage by using the 'getfattr' command. You can only check the quota of a group you are part of (you would be a member of the UNIX group of the same name). If you wanted to check the quota usage of /private/groups/hausslerlab for example, you would do: $ getfattr -n ceph.dir.rbytes /private/groups/hausslerlab getfattr: Removing leading '/' from absolute path names # file: private/groups/hausslerlab ceph.dir.rbytes="6522955553147" That number is in bytes. So divide by 1,000,000,000,000 and you get '6.522 TB'. That is how much data is currently being used. To check the max quota limit, use this command: $ getfattr -n ceph.quota.max_bytes /private/groups/hausslerlab getfattr: Removing leading '/' from absolute path names # file: private/groups/hausslerlab ceph.quota.max_bytes="15000000000000" And 15000000000000 divided by 1,000,000,000,000 is 15 TB. == Storage Quota Alerting == If you and/or folks in your lab would like an automated alert when the /private/groups/labname quota is getting to a certain percentage of fullness, we can set that up for you and others in your lab. Just email '''cluster-admin@soe.ucsc.edu''' with the following information: 1: Which directory you would like to watch quotas on (i.e. /private/groups/somelab) 2: What % full you would like an email alert at 3: What email addresses you want on the alert list After setup, our alerting system will alert folks on that email list ''every 4 hours'' until the quota in question is reduced to an amount under the alerting % threshold you asked for. So it is a bit noisy, but will force folks to delete data in order to stop the alerts. When the system notices that the quota usage has decreased to under the alert threshold, you will receive one final email with an "OK" notification that things are OK now. == /data/scratch Space on the Servers == Each server will generally have a local /data/scratch filesystem that you can use to store temporary files. '''BE ADVISED''' that /data/scratch is not backed up, and the data there could disappear in the event of a disk failure or anything else. Do not store important data there. If it is important, it should be moved somewhere else very soon after creation. == Backups == /private/groups is backed up weekly on Friday nights (which usually takes several days to complete). Please note that the following directories in the tree '''WILL NOT''' be backed up: tmp/ temp/ TMP/ TEMP/ cache/ .cache/ *.tmp/ So if you have data that you know isn't important and should be excluded from the backups, put them in a directory suffixed with ".tmp". Such as this example: /private/groups/clusteradmin/mybams.tmp/ c93ec8f5cd28d6fb596538f5456188661e6108ae 495 494 2024-05-14T18:26:05Z Weiler 3 /* Backups */ wikitext text/x-wiki == Storage == Our servers mount two types of ''shared'' storage; home directories and group storage directories. These home directories will mount over the network to all shared compute servers and the phoenix cluster, so any server you login to will have these filesystems available: '''Filesystem Specifications''' {| class="wikitable" style="text-align:center;" |- style="font-weight:bold; text-align:left;" ! Filesystem<br /> ! /private/home ! /private/groups |- | style="font-weight:bold; text-align:left;" | Default Soft Quota | 30 GB | 15 TB |- | style="font-weight:bold; text-align:left;" | Default Hard Quota | 31 GB | 15 TB |- | style="font-weight:bold; text-align:left;" | Total Capacity | 19 TB | 800 TB |- style="text-align:left;" | style="font-weight:bold;" | Access Speed | Very Fast (NVMe Flash Media) | Very Fast (NVMe Flash Media) |- style="text-align:left;" | style="font-weight:bold;" | Intended Use | This space should be used for login scripts, small bits of code or software repos, etc. No large data should be stored here. | This space should be used for large computational/shared data, large software installations and the like. |} '''Home Directories (/private/home/username)''' Your home directory will be located as "/private/home/username" and has a 30GB soft quota and a 31GB hard quota. Your home directory is meant for small scripts and login data, or a git repo. Please do not try to store large data there or computer on large jobs using data in your home directory. '''Groups Directories (/private/groups/groupname)''' The group storage directories are created per PI, and each group directory has a default 15TB hard quota. For example, if David Haussler is the PI that you report to directly, then the directory would exist as /private/groups/hausslerlab. Request access to that group directory and you will then be able to write to it. Each of those group directories are shared by the lab it belongs to, so you must be wary of everyone's data usage and share the 15TB available per group accordingly. On the compute servers you can check your group's current quota usage by using the 'getfattr' command. You can only check the quota of a group you are part of (you would be a member of the UNIX group of the same name). If you wanted to check the quota usage of /private/groups/hausslerlab for example, you would do: $ getfattr -n ceph.dir.rbytes /private/groups/hausslerlab getfattr: Removing leading '/' from absolute path names # file: private/groups/hausslerlab ceph.dir.rbytes="6522955553147" That number is in bytes. So divide by 1,000,000,000,000 and you get '6.522 TB'. That is how much data is currently being used. To check the max quota limit, use this command: $ getfattr -n ceph.quota.max_bytes /private/groups/hausslerlab getfattr: Removing leading '/' from absolute path names # file: private/groups/hausslerlab ceph.quota.max_bytes="15000000000000" And 15000000000000 divided by 1,000,000,000,000 is 15 TB. == Storage Quota Alerting == If you and/or folks in your lab would like an automated alert when the /private/groups/labname quota is getting to a certain percentage of fullness, we can set that up for you and others in your lab. Just email '''cluster-admin@soe.ucsc.edu''' with the following information: 1: Which directory you would like to watch quotas on (i.e. /private/groups/somelab) 2: What % full you would like an email alert at 3: What email addresses you want on the alert list After setup, our alerting system will alert folks on that email list ''every 4 hours'' until the quota in question is reduced to an amount under the alerting % threshold you asked for. So it is a bit noisy, but will force folks to delete data in order to stop the alerts. When the system notices that the quota usage has decreased to under the alert threshold, you will receive one final email with an "OK" notification that things are OK now. == /data/scratch Space on the Servers == Each server will generally have a local /data/scratch filesystem that you can use to store temporary files. '''BE ADVISED''' that /data/scratch is not backed up, and the data there could disappear in the event of a disk failure or anything else. Do not store important data there. If it is important, it should be moved somewhere else very soon after creation. == Backups == /private/groups is backed up weekly on Friday nights (which usually takes several days to complete). Please note that the following directories in the tree '''WILL NOT''' be backed up: tmp/ temp/ TMP/ TEMP/ cache/ .cache/ scratch/ *.tmp/ So if you have data that you know isn't important and should be excluded from the backups, put them in a directory suffixed with ".tmp". Such as this example: /private/groups/clusteradmin/mybams.tmp/ 76042b97e90c0bab08f9475345abb1c991e0d82b Visual Studio Code (vscode) Configuration Tweaks 0 53 479 2024-04-26T18:11:45Z Weiler 3 Created page with "Visual Studio Code (vscode) is a popular IDE used for writing code, and many people use the remote functionality feature to edit code on a remote server, which is very cool. But the issue with vscode is that it frequently opens way to many files on the remote server in an attempt to cache search databases and code modifications, and that unnecessarily puts a large burden on the remote server kernel for caching filehandles an such, slow down the remote filesystem and se..." wikitext text/x-wiki Visual Studio Code (vscode) is a popular IDE used for writing code, and many people use the remote functionality feature to edit code on a remote server, which is very cool. But the issue with vscode is that it frequently opens way to many files on the remote server in an attempt to cache search databases and code modifications, and that unnecessarily puts a large burden on the remote server kernel for caching filehandles an such, slow down the remote filesystem and server you are working on. The fix seems to be to edit this file on the remote server: ~/.vscode-server/data/Machine/settings.json Create that file if it does not already exist. Then put the following text in it: { "search.exclude": { "**/node_modules": true, "**/bower_components": true, "**/env": true, "**/venv": true }, "files.watcherExclude": { "**/.git/objects/**": true, "**/.git/subtree-cache/**": true, "**/node_modules/*/**": true, "**/.cache/**": true, "**/.conda/**": true, "**/.local/**": true, "**/.nextflow/**": true, "**/env/**": true, "**/venv/**": true, "**/work/**": true } } Then close vscode. The most reliable is to click on the bottom left where it usually says "SSH: mustard.prism" (or whatever server you are connected to). That populates the dropdown from the search bar, and at the bottom there is an action: "Close Remote Connection". Otherwise it seems to reconnect, stay connected in the background sometimes. Then save any open files you have in vscode and close vscode on your laptop or workstation, then re-open it. 01eb56b337e0a67d8ca5e3ec11d8c8fc9c62143c 480 479 2024-04-26T18:12:06Z Weiler 3 wikitext text/x-wiki Visual Studio Code (vscode) is a popular IDE used for writing code, and many people use the remote functionality feature to edit code on a remote server, which is very cool. But the issue with vscode is that it frequently opens way to many files on the remote server in an attempt to cache search databases and code modifications, and that unnecessarily puts a large burden on the remote server kernel for caching filehandles an such, slow down the remote filesystem and server you are working on. The fix seems to be to edit this file on the remote server: ~/.vscode-server/data/Machine/settings.json Create that file if it does not already exist. Then put the following text in it: { "search.exclude": { "**/node_modules": true, "**/bower_components": true, "**/env": true, "**/venv": true }, "files.watcherExclude": { "**/.git/objects/**": true, "**/.git/subtree-cache/**": true, "**/node_modules/*/**": true, "**/.cache/**": true, "**/.conda/**": true, "**/.local/**": true, "**/.nextflow/**": true, "**/env/**": true, "**/venv/**": true, "**/work/**": true } } Then close vscode. The most reliable is to click on the bottom left where it usually says "SSH: mustard.prism" (or whatever server you are connected to). That populates the dropdown from the search bar, and at the bottom there is an action: "Close Remote Connection". Otherwise it seems to reconnect, stay connected in the background sometimes. Then save any open files you have in vscode and close vscode on your laptop or workstation, then re-open it. 9822bbccbe35f9741d55a895ae3ef3473d3ffa64 481 480 2024-04-27T15:09:03Z Weiler 3 wikitext text/x-wiki Visual Studio Code (vscode) is a popular IDE used for writing code, and many people use the remote functionality feature to edit code on a remote server, which is very cool. But the issue with vscode is that it frequently opens way to many files on the remote server in an attempt to cache search databases and code modifications, and that unnecessarily puts a large burden on the remote server kernel for caching filehandles an such, slow down the remote filesystem and server you are working on. The fix seems to be to edit this file on the remote server: ~/.vscode-server/data/Machine/settings.json Create that file if it does not already exist. Then put the following text in it: { "search.exclude": { "**/node_modules": true, "**/bower_components": true, "**/env": true, "**/venv": true }, "files.watcherExclude": { "**/.git/objects/**": true, "**/.git/subtree-cache/**": true, "**/node_modules/*/**": true, "**/.cache/**": true, "**/.conda/**": true, "**/.local/**": true, "**/.nextflow/**": true, "**/env/**": true, "**/venv/**": true, "**/work/**": true, "**/private/groups/**": true } } Then close vscode. The most reliable is to click on the bottom left where it usually says "SSH: mustard.prism" (or whatever server you are connected to). That populates the dropdown from the search bar, and at the bottom there is an action: "Close Remote Connection". Otherwise it seems to reconnect, stay connected in the background sometimes. Then save any open files you have in vscode and close vscode on your laptop or workstation, then re-open it. 437bc07a69e463620143ac7249a7501ba32e7db8 Firewalled Computing Resources Overview 0 41 484 451 2024-05-07T17:36:11Z Weiler 3 wikitext text/x-wiki == Doing Work and Computing == When doing research, running jobs and the like, please be careful of your resource consumption on the server you are on. Don't run too many threads or cores at once if such a thing overruns the RAM available or the disk IO available. If you are not sure of your potential RAM, CPU or disk impact, start small with one or two processes and work your way up from there. Also, before running your stuff, check what else is already happening on the server by using the 'top' command to see who else and what else is running and what kind of resources are already being consumed. If, after starting a process, you realize that the server slows down considerably or becomes unusable, kill your processes and re-evaluate what you need to make things work. These servers are shared resources - be a good neighbor! == Server Types and Management == After confirming your VPN software is working, you can ssh into one of the shared compute servers behind the VPN. The DNS suffix for all machines is ".prism". So, "mustard" would have a full DNS name of "mustard.prism": {| class="wikitable" style="text-align:center;" |- style="font-weight:bold; text-align:left;" ! Node Name ! Operating System<br /> ! CPU Cores ! Memory ! Network Bandwidth ! Scratch Space |- | style="text-align:left;" | mustard | style="text-align:left;" | Ubuntu 22.04 | 160 | 1.5 TB | 10 Gb/s | 9 TB |- | style="text-align:left;" | emerald | style="text-align:left;" | Ubuntu 22.04 | 64 | 1 TB | 10 Gb/s | 690 GB |- | style="text-align:left;" | crimson | style="text-align:left;" | Ubuntu 22.04 | 32 | 256 GB | 10 Gb/s | 5.5 TB |- | style="text-align:left;" | razzmatazz | style="text-align:left;" | Ubuntu 22.04 | 32 | 256 GB | 10 Gb/s | 5.5 TB |} These ''shared'' servers are managed by the Genomics Institute Cluster Admin group. If you need software installed on any of these servers, please make your request by emailing cluster-admin@soe.ucsc.edu. == The Firewall == All servers are behind a firewall in this environment, and as such, you must connect to the VPN in order to access them. They will not be accessible from the greater Internet without VPN. Although you will be able to connect outbound from them to other servers on the internet to copy data in, sync git repos, stuff like that. It is only inbound connections that will be blocked. All machines behind the firewall have the private domain name suffix of "*.prism". == The Phoenix Cluster == This is a cluster of ~22 Ubuntu 22.04 nodes, some of which have GPUs in them. Each node generally has about 2TB RAM and 128 cores, although the cluster is heterogeneous and has multiple node types. You interact with the Phoenix Cluster via the Slurm Job Scheduler. You must specifically request access to use Slurm on the Phoenix Cluster, just email '''cluster-admin@soe.ucsc.edu''' for access. {| class="wikitable" style="text-align:center;" |- style="font-weight:bold;" ! Node Name ! Operating System<br /> ! CPU Cores ! GPUs/Type ! Memory ! Network Bandwidth ! Scratch Space |- | style="text-align:left;" | phoenix-00 | style="text-align:left;" | Ubuntu 22.04 | 256 | 8 / Nvidia A100 | 1 TB | 10 Gb/s | 16 TB NVMe |- | style="text-align:left;" | phoenix-[01-05] | style="text-align:left;" | Ubuntu 22.04 | 128 | 8 / Nvidia A5500 | 2 TB | 10 Gb/s | 16 TB NVMe |- | style="text-align:left;" | phoenix-[06-08] | style="text-align:left;" | Ubuntu 22.04 | 128 | N/A | 2 TB | 10 Gb/s | 16 TB NVMe |- | style="text-align:left;" | phoenix-[11-21] | style="text-align:left;" | Ubuntu 22.04 | 128 | N/A | 2 TB | 10 Gb/s | 16 TB NVMe |} The cluster head node is '''phoenix.prism'''. Although you do not need to be logged into phoenix.prism to submit Slurm jobs, jobs can be submitted from any interactive compute server (mustard, emerald, razzmatazz or crimson). To learn more about how to use Slurm, refer to: https://giwiki.gi.ucsc.edu/index.php/Genomics_Institute_Computing_Information#Slurm_at_the_Genomics_Institute For scratch on the cluster, TMPDIR will be set to /data/tmp. That area is cleaned often so don't store any data there that isn't being used by your jobs. 61a85b3b432a3ef7364b05bbd1c4f86594c5cc41 490 484 2024-05-11T15:08:21Z Weiler 3 wikitext text/x-wiki == Doing Work and Computing == When doing research, running jobs and the like, please be careful of your resource consumption on the server you are on. Don't run too many threads or cores at once if such a thing overruns the RAM available or the disk IO available. If you are not sure of your potential RAM, CPU or disk impact, start small with one or two processes and work your way up from there. Also, before running your stuff, check what else is already happening on the server by using the 'top' command to see who else and what else is running and what kind of resources are already being consumed. If, after starting a process, you realize that the server slows down considerably or becomes unusable, kill your processes and re-evaluate what you need to make things work. These servers are shared resources - be a good neighbor! == Server Types and Management == After confirming your VPN software is working, you can ssh into one of the shared compute servers behind the VPN. The DNS suffix for all machines is ".prism". So, "mustard" would have a full DNS name of "mustard.prism": {| class="wikitable" style="text-align:center;" |- style="font-weight:bold; text-align:left;" ! Node Name ! Operating System<br /> ! CPU Cores ! Memory ! Network Bandwidth ! Scratch Space |- | style="text-align:left;" | mustard | style="text-align:left;" | Ubuntu 22.04 | 160 | 1.5 TB | 10 Gb/s | 9 TB |- | style="text-align:left;" | emerald | style="text-align:left;" | Ubuntu 22.04 | 64 | 1 TB | 10 Gb/s | 690 GB |- | style="text-align:left;" | crimson | style="text-align:left;" | Ubuntu 22.04 | 32 | 256 GB | 10 Gb/s | 5.5 TB |- | style="text-align:left;" | razzmatazz | style="text-align:left;" | Ubuntu 22.04 | 32 | 256 GB | 10 Gb/s | 5.5 TB |} These ''shared'' servers are managed by the Genomics Institute Cluster Admin group. If you need software installed on any of these servers, please make your request by emailing cluster-admin@soe.ucsc.edu. == The Firewall == All servers are behind a firewall in this environment, and as such, you must connect to the VPN in order to access them. They will not be accessible from the greater Internet without VPN. Although you will be able to connect outbound from them to other servers on the internet to copy data in, sync git repos, stuff like that. It is only inbound connections that will be blocked. All machines behind the firewall have the private domain name suffix of "*.prism". == The Phoenix Cluster == This is a cluster of ~22 Ubuntu 22.04 nodes, some of which have GPUs in them. Each node generally has about 2TB RAM and 128 cores, although the cluster is heterogeneous and has multiple node types. You interact with the Phoenix Cluster via the Slurm Job Scheduler. You must specifically request access to use Slurm on the Phoenix Cluster, just email '''cluster-admin@soe.ucsc.edu''' for access. {| class="wikitable" style="text-align:center;" |- style="font-weight:bold;" ! Node Name ! Operating System<br /> ! CPU Cores ! GPUs/Type ! Memory ! Network Bandwidth ! Scratch Space |- | style="text-align:left;" | phoenix-00 | style="text-align:left;" | Ubuntu 22.04 | 256 | 8 / Nvidia A100 | 1 TB | 10 Gb/s | 16 TB NVMe |- | style="text-align:left;" | phoenix-[01-05] | style="text-align:left;" | Ubuntu 22.04 | 128 | 8 / Nvidia A5500 | 2 TB | 10 Gb/s | 16 TB NVMe |- | style="text-align:left;" | phoenix-[06-08] | style="text-align:left;" | Ubuntu 22.04 | 128 | N/A | 2 TB | 10 Gb/s | 16 TB NVMe |- | style="text-align:left;" | phoenix-[11-21] | style="text-align:left;" | Ubuntu 22.04 | 128 | N/A | 2 TB | 10 Gb/s | 16 TB NVMe |} The cluster head node is '''phoenix.prism'''. However, you cannot directly login to phoenix.prism in order to protect the scheduler from errant or runaway jobs there, so jobs can be submitted from any interactive compute server (mustard, emerald, razzmatazz or crimson). To learn more about how to use Slurm, refer to: https://giwiki.gi.ucsc.edu/index.php/Genomics_Institute_Computing_Information#Slurm_at_the_Genomics_Institute For scratch on the cluster, TMPDIR will be set to /data/tmp. That area is cleaned often so don't store any data there that isn't being used by your jobs. 2f0c5689b2fc019cf4203e3a541e983118a6dff7 498 490 2024-06-10T18:21:06Z Weiler 3 /* The Phoenix Cluster */ wikitext text/x-wiki == Doing Work and Computing == When doing research, running jobs and the like, please be careful of your resource consumption on the server you are on. Don't run too many threads or cores at once if such a thing overruns the RAM available or the disk IO available. If you are not sure of your potential RAM, CPU or disk impact, start small with one or two processes and work your way up from there. Also, before running your stuff, check what else is already happening on the server by using the 'top' command to see who else and what else is running and what kind of resources are already being consumed. If, after starting a process, you realize that the server slows down considerably or becomes unusable, kill your processes and re-evaluate what you need to make things work. These servers are shared resources - be a good neighbor! == Server Types and Management == After confirming your VPN software is working, you can ssh into one of the shared compute servers behind the VPN. The DNS suffix for all machines is ".prism". So, "mustard" would have a full DNS name of "mustard.prism": {| class="wikitable" style="text-align:center;" |- style="font-weight:bold; text-align:left;" ! Node Name ! Operating System<br /> ! CPU Cores ! Memory ! Network Bandwidth ! Scratch Space |- | style="text-align:left;" | mustard | style="text-align:left;" | Ubuntu 22.04 | 160 | 1.5 TB | 10 Gb/s | 9 TB |- | style="text-align:left;" | emerald | style="text-align:left;" | Ubuntu 22.04 | 64 | 1 TB | 10 Gb/s | 690 GB |- | style="text-align:left;" | crimson | style="text-align:left;" | Ubuntu 22.04 | 32 | 256 GB | 10 Gb/s | 5.5 TB |- | style="text-align:left;" | razzmatazz | style="text-align:left;" | Ubuntu 22.04 | 32 | 256 GB | 10 Gb/s | 5.5 TB |} These ''shared'' servers are managed by the Genomics Institute Cluster Admin group. If you need software installed on any of these servers, please make your request by emailing cluster-admin@soe.ucsc.edu. == The Firewall == All servers are behind a firewall in this environment, and as such, you must connect to the VPN in order to access them. They will not be accessible from the greater Internet without VPN. Although you will be able to connect outbound from them to other servers on the internet to copy data in, sync git repos, stuff like that. It is only inbound connections that will be blocked. All machines behind the firewall have the private domain name suffix of "*.prism". == The Phoenix Cluster == This is a cluster of ~22 Ubuntu 22.04 nodes, some of which have GPUs in them. Each node generally has about 2TB RAM and 128 cores, although the cluster is heterogeneous and has multiple node types. You interact with the Phoenix Cluster via the Slurm Job Scheduler. You must specifically request access to use Slurm on the Phoenix Cluster, just email '''cluster-admin@soe.ucsc.edu''' for access. {| class="wikitable" style="text-align:center;" |- style="font-weight:bold;" ! Node Name ! Operating System<br /> ! CPU Cores ! GPUs/Type ! Memory ! Network Bandwidth ! Scratch Space |- | style="text-align:left;" | phoenix-00 | style="text-align:left;" | Ubuntu 22.04 | 256 | 8 / Nvidia A100 | 1 TB | 10 Gb/s | 16 TB NVMe |- | style="text-align:left;" | phoenix-[01-05] | style="text-align:left;" | Ubuntu 22.04 | 128 | 8 / Nvidia A5500 | 2 TB | 10 Gb/s | 16 TB NVMe |- | style="text-align:left;" | phoenix-[06-08] | style="text-align:left;" | Ubuntu 22.04 | 256 | N/A | 2 TB | 10 Gb/s | 16 TB NVMe |- | style="text-align:left;" | phoenix-[09-10] | style="text-align:left;" | Ubuntu 22.04 | 384 | N/A | 2.3 TB | 10 Gb/s | 16 TB NVMe |- | style="text-align:left;" | phoenix-[11-21] | style="text-align:left;" | Ubuntu 22.04 | 256 | N/A | 2 TB | 10 Gb/s | 16 TB NVMe |- | style="text-align:left;" | phoenix-[22-24] | style="text-align:left;" | Ubuntu 22.04 | 384 | N/A | 2.3 TB | 10 Gb/s | 16 TB NVMe |} The cluster head node is '''phoenix.prism'''. However, you cannot directly login to phoenix.prism in order to protect the scheduler from errant or runaway jobs there, so jobs can be submitted from any interactive compute server (mustard, emerald, razzmatazz or crimson). To learn more about how to use Slurm, refer to: https://giwiki.gi.ucsc.edu/index.php/Genomics_Institute_Computing_Information#Slurm_at_the_Genomics_Institute For scratch on the cluster, TMPDIR will be set to /data/tmp. That area is cleaned often so don't store any data there that isn't being used by your jobs. c904979eb1d7363210d6ec9cf8b6ba601ae5a6a0 499 498 2024-06-10T18:21:25Z Weiler 3 /* The Phoenix Cluster */ wikitext text/x-wiki == Doing Work and Computing == When doing research, running jobs and the like, please be careful of your resource consumption on the server you are on. Don't run too many threads or cores at once if such a thing overruns the RAM available or the disk IO available. If you are not sure of your potential RAM, CPU or disk impact, start small with one or two processes and work your way up from there. Also, before running your stuff, check what else is already happening on the server by using the 'top' command to see who else and what else is running and what kind of resources are already being consumed. If, after starting a process, you realize that the server slows down considerably or becomes unusable, kill your processes and re-evaluate what you need to make things work. These servers are shared resources - be a good neighbor! == Server Types and Management == After confirming your VPN software is working, you can ssh into one of the shared compute servers behind the VPN. The DNS suffix for all machines is ".prism". So, "mustard" would have a full DNS name of "mustard.prism": {| class="wikitable" style="text-align:center;" |- style="font-weight:bold; text-align:left;" ! Node Name ! Operating System<br /> ! CPU Cores ! Memory ! Network Bandwidth ! Scratch Space |- | style="text-align:left;" | mustard | style="text-align:left;" | Ubuntu 22.04 | 160 | 1.5 TB | 10 Gb/s | 9 TB |- | style="text-align:left;" | emerald | style="text-align:left;" | Ubuntu 22.04 | 64 | 1 TB | 10 Gb/s | 690 GB |- | style="text-align:left;" | crimson | style="text-align:left;" | Ubuntu 22.04 | 32 | 256 GB | 10 Gb/s | 5.5 TB |- | style="text-align:left;" | razzmatazz | style="text-align:left;" | Ubuntu 22.04 | 32 | 256 GB | 10 Gb/s | 5.5 TB |} These ''shared'' servers are managed by the Genomics Institute Cluster Admin group. If you need software installed on any of these servers, please make your request by emailing cluster-admin@soe.ucsc.edu. == The Firewall == All servers are behind a firewall in this environment, and as such, you must connect to the VPN in order to access them. They will not be accessible from the greater Internet without VPN. Although you will be able to connect outbound from them to other servers on the internet to copy data in, sync git repos, stuff like that. It is only inbound connections that will be blocked. All machines behind the firewall have the private domain name suffix of "*.prism". == The Phoenix Cluster == This is a cluster of ~22 Ubuntu 22.04 nodes, some of which have GPUs in them. Each node generally has about 2TB RAM and 128 cores, although the cluster is heterogeneous and has multiple node types. You interact with the Phoenix Cluster via the Slurm Job Scheduler. You must specifically request access to use Slurm on the Phoenix Cluster, just email '''cluster-admin@soe.ucsc.edu''' for access. {| class="wikitable" style="text-align:center;" |- style="font-weight:bold;" ! Node Name ! Operating System<br /> ! CPU Cores ! GPUs/Type ! Memory ! Network Bandwidth ! Scratch Space |- | style="text-align:left;" | phoenix-00 | style="text-align:left;" | Ubuntu 22.04 | 256 | 8 / Nvidia A100 | 1 TB | 10 Gb/s | 16 TB NVMe |- | style="text-align:left;" | phoenix-[01-05] | style="text-align:left;" | Ubuntu 22.04 | 256 | 8 / Nvidia A5500 | 2 TB | 10 Gb/s | 16 TB NVMe |- | style="text-align:left;" | phoenix-[06-08] | style="text-align:left;" | Ubuntu 22.04 | 256 | N/A | 2 TB | 10 Gb/s | 16 TB NVMe |- | style="text-align:left;" | phoenix-[09-10] | style="text-align:left;" | Ubuntu 22.04 | 384 | N/A | 2.3 TB | 10 Gb/s | 16 TB NVMe |- | style="text-align:left;" | phoenix-[11-21] | style="text-align:left;" | Ubuntu 22.04 | 256 | N/A | 2 TB | 10 Gb/s | 16 TB NVMe |- | style="text-align:left;" | phoenix-[22-24] | style="text-align:left;" | Ubuntu 22.04 | 384 | N/A | 2.3 TB | 10 Gb/s | 16 TB NVMe |} The cluster head node is '''phoenix.prism'''. However, you cannot directly login to phoenix.prism in order to protect the scheduler from errant or runaway jobs there, so jobs can be submitted from any interactive compute server (mustard, emerald, razzmatazz or crimson). To learn more about how to use Slurm, refer to: https://giwiki.gi.ucsc.edu/index.php/Genomics_Institute_Computing_Information#Slurm_at_the_Genomics_Institute For scratch on the cluster, TMPDIR will be set to /data/tmp. That area is cleaned often so don't store any data there that isn't being used by your jobs. a55b470229379015ba5c96a8b000078a65729060 500 499 2024-06-10T18:22:00Z Weiler 3 /* The Phoenix Cluster */ wikitext text/x-wiki == Doing Work and Computing == When doing research, running jobs and the like, please be careful of your resource consumption on the server you are on. Don't run too many threads or cores at once if such a thing overruns the RAM available or the disk IO available. If you are not sure of your potential RAM, CPU or disk impact, start small with one or two processes and work your way up from there. Also, before running your stuff, check what else is already happening on the server by using the 'top' command to see who else and what else is running and what kind of resources are already being consumed. If, after starting a process, you realize that the server slows down considerably or becomes unusable, kill your processes and re-evaluate what you need to make things work. These servers are shared resources - be a good neighbor! == Server Types and Management == After confirming your VPN software is working, you can ssh into one of the shared compute servers behind the VPN. The DNS suffix for all machines is ".prism". So, "mustard" would have a full DNS name of "mustard.prism": {| class="wikitable" style="text-align:center;" |- style="font-weight:bold; text-align:left;" ! Node Name ! Operating System<br /> ! CPU Cores ! Memory ! Network Bandwidth ! Scratch Space |- | style="text-align:left;" | mustard | style="text-align:left;" | Ubuntu 22.04 | 160 | 1.5 TB | 10 Gb/s | 9 TB |- | style="text-align:left;" | emerald | style="text-align:left;" | Ubuntu 22.04 | 64 | 1 TB | 10 Gb/s | 690 GB |- | style="text-align:left;" | crimson | style="text-align:left;" | Ubuntu 22.04 | 32 | 256 GB | 10 Gb/s | 5.5 TB |- | style="text-align:left;" | razzmatazz | style="text-align:left;" | Ubuntu 22.04 | 32 | 256 GB | 10 Gb/s | 5.5 TB |} These ''shared'' servers are managed by the Genomics Institute Cluster Admin group. If you need software installed on any of these servers, please make your request by emailing cluster-admin@soe.ucsc.edu. == The Firewall == All servers are behind a firewall in this environment, and as such, you must connect to the VPN in order to access them. They will not be accessible from the greater Internet without VPN. Although you will be able to connect outbound from them to other servers on the internet to copy data in, sync git repos, stuff like that. It is only inbound connections that will be blocked. All machines behind the firewall have the private domain name suffix of "*.prism". == The Phoenix Cluster == This is a cluster of ~22 Ubuntu 22.04 nodes, some of which have GPUs in them. Each node generally has about 2TB RAM and 256 cores, although the cluster is heterogeneous and has multiple node types. You interact with the Phoenix Cluster via the Slurm Job Scheduler. You must specifically request access to use Slurm on the Phoenix Cluster, just email '''cluster-admin@soe.ucsc.edu''' for access. {| class="wikitable" style="text-align:center;" |- style="font-weight:bold;" ! Node Name ! Operating System<br /> ! CPU Cores ! GPUs/Type ! Memory ! Network Bandwidth ! Scratch Space |- | style="text-align:left;" | phoenix-00 | style="text-align:left;" | Ubuntu 22.04 | 256 | 8 / Nvidia A100 | 1 TB | 10 Gb/s | 16 TB NVMe |- | style="text-align:left;" | phoenix-[01-05] | style="text-align:left;" | Ubuntu 22.04 | 256 | 8 / Nvidia A5500 | 2 TB | 10 Gb/s | 16 TB NVMe |- | style="text-align:left;" | phoenix-[06-08] | style="text-align:left;" | Ubuntu 22.04 | 256 | N/A | 2 TB | 10 Gb/s | 16 TB NVMe |- | style="text-align:left;" | phoenix-[09-10] | style="text-align:left;" | Ubuntu 22.04 | 384 | N/A | 2.3 TB | 10 Gb/s | 16 TB NVMe |- | style="text-align:left;" | phoenix-[11-21] | style="text-align:left;" | Ubuntu 22.04 | 256 | N/A | 2 TB | 10 Gb/s | 16 TB NVMe |- | style="text-align:left;" | phoenix-[22-24] | style="text-align:left;" | Ubuntu 22.04 | 384 | N/A | 2.3 TB | 10 Gb/s | 16 TB NVMe |} The cluster head node is '''phoenix.prism'''. However, you cannot directly login to phoenix.prism in order to protect the scheduler from errant or runaway jobs there, so jobs can be submitted from any interactive compute server (mustard, emerald, razzmatazz or crimson). To learn more about how to use Slurm, refer to: https://giwiki.gi.ucsc.edu/index.php/Genomics_Institute_Computing_Information#Slurm_at_the_Genomics_Institute For scratch on the cluster, TMPDIR will be set to /data/tmp. That area is cleaned often so don't store any data there that isn't being used by your jobs. 1c1c418cf200966df761786b94ea9436a5a9fd2b 501 500 2024-06-10T18:22:52Z Weiler 3 /* The Phoenix Cluster */ wikitext text/x-wiki == Doing Work and Computing == When doing research, running jobs and the like, please be careful of your resource consumption on the server you are on. Don't run too many threads or cores at once if such a thing overruns the RAM available or the disk IO available. If you are not sure of your potential RAM, CPU or disk impact, start small with one or two processes and work your way up from there. Also, before running your stuff, check what else is already happening on the server by using the 'top' command to see who else and what else is running and what kind of resources are already being consumed. If, after starting a process, you realize that the server slows down considerably or becomes unusable, kill your processes and re-evaluate what you need to make things work. These servers are shared resources - be a good neighbor! == Server Types and Management == After confirming your VPN software is working, you can ssh into one of the shared compute servers behind the VPN. The DNS suffix for all machines is ".prism". So, "mustard" would have a full DNS name of "mustard.prism": {| class="wikitable" style="text-align:center;" |- style="font-weight:bold; text-align:left;" ! Node Name ! Operating System<br /> ! CPU Cores ! Memory ! Network Bandwidth ! Scratch Space |- | style="text-align:left;" | mustard | style="text-align:left;" | Ubuntu 22.04 | 160 | 1.5 TB | 10 Gb/s | 9 TB |- | style="text-align:left;" | emerald | style="text-align:left;" | Ubuntu 22.04 | 64 | 1 TB | 10 Gb/s | 690 GB |- | style="text-align:left;" | crimson | style="text-align:left;" | Ubuntu 22.04 | 32 | 256 GB | 10 Gb/s | 5.5 TB |- | style="text-align:left;" | razzmatazz | style="text-align:left;" | Ubuntu 22.04 | 32 | 256 GB | 10 Gb/s | 5.5 TB |} These ''shared'' servers are managed by the Genomics Institute Cluster Admin group. If you need software installed on any of these servers, please make your request by emailing cluster-admin@soe.ucsc.edu. == The Firewall == All servers are behind a firewall in this environment, and as such, you must connect to the VPN in order to access them. They will not be accessible from the greater Internet without VPN. Although you will be able to connect outbound from them to other servers on the internet to copy data in, sync git repos, stuff like that. It is only inbound connections that will be blocked. All machines behind the firewall have the private domain name suffix of "*.prism". == The Phoenix Cluster == This is a cluster of ~22 Ubuntu 22.04 nodes, some of which have GPUs in them. Each node generally has about 2TB RAM and 256 cores, although the cluster is heterogeneous and has multiple node types. You interact with the Phoenix Cluster via the Slurm Job Scheduler. You must specifically request access to use Slurm on the Phoenix Cluster, just email '''cluster-admin@soe.ucsc.edu''' for access. {| class="wikitable" style="text-align:center;" |- style="font-weight:bold;" ! Node Name ! Operating System<br /> ! CPU Cores ! GPUs/Type ! Memory ! Network Bandwidth ! Scratch Space |- | style="text-align:left;" | phoenix-00 | style="text-align:left;" | Ubuntu 22.04 | 256 | 8 / Nvidia A100 | 1 TB | 10 Gb/s | 16 TB NVMe |- | style="text-align:left;" | phoenix-[01-05] | style="text-align:left;" | Ubuntu 22.04 | 256 | 8 / Nvidia A5500 | 2 TB | 10 Gb/s | 16 TB NVMe |- | style="text-align:left;" | phoenix-[06-08] | style="text-align:left;" | Ubuntu 22.04 | 256 | N/A | 2 TB | 10 Gb/s | 16 TB NVMe |- | style="text-align:left;" | phoenix-[09-10] | style="text-align:left;" | Ubuntu 22.04 | 384 | N/A | 2.3 TB | 10 Gb/s | 16 TB NVMe |- | style="text-align:left;" | phoenix-[11-21] | style="text-align:left;" | Ubuntu 22.04 | 256 | N/A | 2 TB | 10 Gb/s | 16 TB NVMe |- | style="text-align:left;" | phoenix-[22-24] | style="text-align:left;" | Ubuntu 22.04 | 384 | N/A | 2.3 TB | 10 Gb/s | 16 TB NVMe |} The cluster head node is '''phoenix.prism'''. However, you cannot directly login to phoenix.prism in order to protect the scheduler from errant or runaway jobs there, so jobs can be submitted from any interactive compute server (mustard, emerald, razzmatazz or crimson). To learn more about how to use Slurm, refer to: https://giwiki.gi.ucsc.edu/index.php/Genomics_Institute_Computing_Information#Slurm_at_the_Genomics_Institute For scratch on the cluster, TMPDIR will be set to /data/tmp (which is local to each cluster node). That area is cleaned often so don't store any data there that isn't being used by your jobs. c33beea9778fc5df5aba7888010da0710b5fa899 502 501 2024-06-10T18:23:11Z Weiler 3 /* The Phoenix Cluster */ wikitext text/x-wiki == Doing Work and Computing == When doing research, running jobs and the like, please be careful of your resource consumption on the server you are on. Don't run too many threads or cores at once if such a thing overruns the RAM available or the disk IO available. If you are not sure of your potential RAM, CPU or disk impact, start small with one or two processes and work your way up from there. Also, before running your stuff, check what else is already happening on the server by using the 'top' command to see who else and what else is running and what kind of resources are already being consumed. If, after starting a process, you realize that the server slows down considerably or becomes unusable, kill your processes and re-evaluate what you need to make things work. These servers are shared resources - be a good neighbor! == Server Types and Management == After confirming your VPN software is working, you can ssh into one of the shared compute servers behind the VPN. The DNS suffix for all machines is ".prism". So, "mustard" would have a full DNS name of "mustard.prism": {| class="wikitable" style="text-align:center;" |- style="font-weight:bold; text-align:left;" ! Node Name ! Operating System<br /> ! CPU Cores ! Memory ! Network Bandwidth ! Scratch Space |- | style="text-align:left;" | mustard | style="text-align:left;" | Ubuntu 22.04 | 160 | 1.5 TB | 10 Gb/s | 9 TB |- | style="text-align:left;" | emerald | style="text-align:left;" | Ubuntu 22.04 | 64 | 1 TB | 10 Gb/s | 690 GB |- | style="text-align:left;" | crimson | style="text-align:left;" | Ubuntu 22.04 | 32 | 256 GB | 10 Gb/s | 5.5 TB |- | style="text-align:left;" | razzmatazz | style="text-align:left;" | Ubuntu 22.04 | 32 | 256 GB | 10 Gb/s | 5.5 TB |} These ''shared'' servers are managed by the Genomics Institute Cluster Admin group. If you need software installed on any of these servers, please make your request by emailing cluster-admin@soe.ucsc.edu. == The Firewall == All servers are behind a firewall in this environment, and as such, you must connect to the VPN in order to access them. They will not be accessible from the greater Internet without VPN. Although you will be able to connect outbound from them to other servers on the internet to copy data in, sync git repos, stuff like that. It is only inbound connections that will be blocked. All machines behind the firewall have the private domain name suffix of "*.prism". == The Phoenix Cluster == This is a cluster of 25 Ubuntu 22.04 nodes, some of which have GPUs in them. Each node generally has about 2TB RAM and 256 cores, although the cluster is heterogeneous and has multiple node types. You interact with the Phoenix Cluster via the Slurm Job Scheduler. You must specifically request access to use Slurm on the Phoenix Cluster, just email '''cluster-admin@soe.ucsc.edu''' for access. {| class="wikitable" style="text-align:center;" |- style="font-weight:bold;" ! Node Name ! Operating System<br /> ! CPU Cores ! GPUs/Type ! Memory ! Network Bandwidth ! Scratch Space |- | style="text-align:left;" | phoenix-00 | style="text-align:left;" | Ubuntu 22.04 | 256 | 8 / Nvidia A100 | 1 TB | 10 Gb/s | 16 TB NVMe |- | style="text-align:left;" | phoenix-[01-05] | style="text-align:left;" | Ubuntu 22.04 | 256 | 8 / Nvidia A5500 | 2 TB | 10 Gb/s | 16 TB NVMe |- | style="text-align:left;" | phoenix-[06-08] | style="text-align:left;" | Ubuntu 22.04 | 256 | N/A | 2 TB | 10 Gb/s | 16 TB NVMe |- | style="text-align:left;" | phoenix-[09-10] | style="text-align:left;" | Ubuntu 22.04 | 384 | N/A | 2.3 TB | 10 Gb/s | 16 TB NVMe |- | style="text-align:left;" | phoenix-[11-21] | style="text-align:left;" | Ubuntu 22.04 | 256 | N/A | 2 TB | 10 Gb/s | 16 TB NVMe |- | style="text-align:left;" | phoenix-[22-24] | style="text-align:left;" | Ubuntu 22.04 | 384 | N/A | 2.3 TB | 10 Gb/s | 16 TB NVMe |} The cluster head node is '''phoenix.prism'''. However, you cannot directly login to phoenix.prism in order to protect the scheduler from errant or runaway jobs there, so jobs can be submitted from any interactive compute server (mustard, emerald, razzmatazz or crimson). To learn more about how to use Slurm, refer to: https://giwiki.gi.ucsc.edu/index.php/Genomics_Institute_Computing_Information#Slurm_at_the_Genomics_Institute For scratch on the cluster, TMPDIR will be set to /data/tmp (which is local to each cluster node). That area is cleaned often so don't store any data there that isn't being used by your jobs. 7087543fbdc2b5689562019ebe7fc55ed9f89888 Firewalled User Account and Storage Cost 0 43 485 367 2024-05-07T17:38:14Z Weiler 3 wikitext text/x-wiki == Account and Storage Cost == Costs for having an active UNIX accounts is listed in this document: https://sites.google.com/view/ucscgenomicsinstitute/finance/recharge-services-rates?authuser=0 As of the writing of this document, it looks like this: {| class="wikitable" |- style="font-weight:bold; text-align:center;" ! Service ! Cost |- | UNIX User Account per Month | style="text-align:center;" | $35.34 |- | OpenStack User Account per Month | style="text-align:center;" | $35.34 |- | TB of Storage per Month | style="text-align:center;" | Currently free, but may change one day |} The sponsor of each user and owner of each /private/groups/labname area provides a FOAPAL to our finance group to cover the monthly cost of these resources. 6c65e66b0d118321792926cc4b39ac807a533318 Requirements for dbGaP Access 0 19 486 235 2024-05-07T17:53:44Z Weiler 3 wikitext text/x-wiki If you need NIH dbGaP access, there are several requirements to gaining access - please complete all these requirements '''BEFORE''' requesting dbGaP credentials. NOTE: If you already have GI VPN access to the GI "Prism" Environment, then you have already completed the requirements detailed below - let Haifang Telc (haifang@ucsc.edu) know and we can quickly move to getting you set up. Please use this checklist to make sure that you have completed all '''three''' requirements. '''1'''. Your PI's info and your PI's approval '''2'''. NIH Public Security Refresher Course Certificate '''3'''. Signed NIH Genomic Data Sharing Policy Agreement '''1''': You are required to ask your PI or sponsor to email '''cluster-admin@soe.ucsc.edu)''' requesting dbGaP access for you - this email should include: Your name Your PI's name PI's approval for this access '''2''': You must take the NIH Public Security Refresher Course online, then print out the Completion Certificate (which should have your name on it) at the end of the training and deliver it to the GI Grants Team. You must complete the course in a single continuous sitting in order to be able to print the certificate at the end: https://irtsectraining.nih.gov/publicUser.aspx Click on the "2020 Information Security, Counterintelligence, Privacy Awareness, Records Management Refresher" link to begin the course. At the end you will be able to print out the completion certificate that should have your name on it. '''3''': Please print, read entire NIH Genomic Data Sharing Policy agreement (link located below for download), sign the last page of the document, scan and email executed document to haifang@ucsc.edu with Subject Line to include: NIH GDS document. By signing the document you agree that you have read and understand the policies described therein and that you agree to abide by those policies: [[Media:NIH_GDS_Policy.pdf]] 4608d9b477742ec0348020de6c23ad757dae02ba 487 486 2024-05-07T17:53:56Z Weiler 3 wikitext text/x-wiki If you need NIH dbGaP access, there are several requirements to gaining access - please complete all these requirements '''BEFORE''' requesting dbGaP credentials. NOTE: If you already have GI VPN access to the GI "Prism" Environment, then you have already completed the requirements detailed below - let Haifang Telc (haifang@ucsc.edu) know and we can quickly move to getting you set up. Please use this checklist to make sure that you have completed all '''three''' requirements. '''1'''. Your PI's info and your PI's approval '''2'''. NIH Public Security Refresher Course Certificate '''3'''. Signed NIH Genomic Data Sharing Policy Agreement '''1''': You are required to ask your PI or sponsor to email '''cluster-admin@soe.ucsc.edu''' requesting dbGaP access for you - this email should include: Your name Your PI's name PI's approval for this access '''2''': You must take the NIH Public Security Refresher Course online, then print out the Completion Certificate (which should have your name on it) at the end of the training and deliver it to the GI Grants Team. You must complete the course in a single continuous sitting in order to be able to print the certificate at the end: https://irtsectraining.nih.gov/publicUser.aspx Click on the "2020 Information Security, Counterintelligence, Privacy Awareness, Records Management Refresher" link to begin the course. At the end you will be able to print out the completion certificate that should have your name on it. '''3''': Please print, read entire NIH Genomic Data Sharing Policy agreement (link located below for download), sign the last page of the document, scan and email executed document to haifang@ucsc.edu with Subject Line to include: NIH GDS document. By signing the document you agree that you have read and understand the policies described therein and that you agree to abide by those policies: [[Media:NIH_GDS_Policy.pdf]] 45444eadaf1f22ed6b157d079f3200d84403b9c1 Overview of using Slurm 0 32 489 445 2024-05-11T15:06:33Z Weiler 3 wikitext text/x-wiki When using Slurm, you will need to log into one of the interactive compute servers in the PRISM area (such as emerald, mustard, crimson or razzmatazz). Once you have ssh'd in there, you can execute slurm batch or interactive commands. You might also want to consult the [[Quick Reference Guide]]. == Submit a Slurm Batch Job == In order to submit a Slurm batch job list, you will need to create a directory that you will have read and write access to on all the nodes (which will often be a shared space). Let's say I have a batch named "experiment-1". I would create that directory in my groups area: % mkdir -p /private/groups/clusteradmin/weiler/slurm-jobs/experiment-1 % cd /private/groups/clusteradmin/weiler/slurm-jobs/experiment-1 Then you will need to create your job submission batch file. It will look something like this. My file is called 'slurm-test.sh': % vim slurm-test.sh Then populate the file as necessary: #!/bin/bash # Job name: #SBATCH --job-name=weiler_test # # Partition - This is the queue it goes in: #SBATCH --partition=short # # Where to send email (optional) #SBATCH --mail-user=weiler@ucsc.edu # # Number of nodes you need per job: #SBATCH --nodes=1 # # Memory needed for the jobs. Try very hard to make this accurate. DEFAULT = 4gb #SBATCH --mem=4gb # # Number of tasks (one for each CPU desired for use case) (example): #SBATCH --ntasks=1 # # Processors per task: # At least eight times the number of GPUs needed for nVidia RTX A5500 #SBATCH --cpus-per-task=1 # # Number of GPUs, this can be in the format of "--gres=gpu:[1-8]", or "--gres=gpu:A5500:[1-8]" with the type included (optional) #SBATCH --gres=gpu:1 # # Standard output and error log #SBATCH --output=serial_test_%j.log # # Wall clock limit in hrs:min:sec: #SBATCH --time=00:00:30 # ## Command(s) to run (example): pwd; hostname; date echo "Running test script on a single CPU core" sleep 5 echo "Test done!" date Keep the "SBATCH" lines commented, the scheduler will read them anyway. If you don't need a particular option, just don't include it in the file. To submit the batch job: % sbatch slurm-test.sh Submitted batch job 7 The job(s) will then be scheduled. You can see the state of the queue as such: % squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 7 batch weiler_t weiler R 0:07 1 phoenix-01 The job will output any STDOUT or STDERR in the directory you launched the job from. Other than that, it will do whatever the job does, even if there is no STDOUT. == Launching Several Jobs at Once == You can launch many jobs at once using the $SLURM_ARRAY_TASK_ID variable. Add something like the following to your batch submission file: #SBATCH --array=0-31 #SBATCH --output=array_job_%A_task_%a.out #SBATCH --error=array_job_%A_task_%a.err ## Command(s) to run: echo "I am task $SLURM_ARRAY_TASK_ID" == CGROUPS and Resource Management == Our installation of Slurm will utilize Linux CGROUPS, which puts a hard resource cap on jobs. If you define that your job will need 4GB of RAM, and it uses 5GB, it will fail with an OOM exception. Likewise with CPU or GPU resources; if your job ends up using more than you specify, it will fail. Or the "--time" batch file option, your job will fail if it takes longer than what you specify there. This is to keep the nodes from crashing from runaway jobs that use more resources than you think they will. So... TEST YOUR JOBS! Find our how much in resources a single job needs before you launch 100 of them. == TEST YOUR JOBS! == Let me say that one more time. Test your jobs before launching a bunch of them! If it fails, you don't want it to fail 100 or more times. You can also get a good idea of how much RAM and CPU it will need so you can better define your batch files. It's critical to get a good idea of how many resources each of your jobs will use and define your job file appropriately. d9c28dd6110a4fa3ec15538c1abd2146eea69a76 496 489 2024-05-17T21:00:35Z Weiler 3 /* CGROUPS and Resource Management */ wikitext text/x-wiki When using Slurm, you will need to log into one of the interactive compute servers in the PRISM area (such as emerald, mustard, crimson or razzmatazz). Once you have ssh'd in there, you can execute slurm batch or interactive commands. You might also want to consult the [[Quick Reference Guide]]. == Submit a Slurm Batch Job == In order to submit a Slurm batch job list, you will need to create a directory that you will have read and write access to on all the nodes (which will often be a shared space). Let's say I have a batch named "experiment-1". I would create that directory in my groups area: % mkdir -p /private/groups/clusteradmin/weiler/slurm-jobs/experiment-1 % cd /private/groups/clusteradmin/weiler/slurm-jobs/experiment-1 Then you will need to create your job submission batch file. It will look something like this. My file is called 'slurm-test.sh': % vim slurm-test.sh Then populate the file as necessary: #!/bin/bash # Job name: #SBATCH --job-name=weiler_test # # Partition - This is the queue it goes in: #SBATCH --partition=short # # Where to send email (optional) #SBATCH --mail-user=weiler@ucsc.edu # # Number of nodes you need per job: #SBATCH --nodes=1 # # Memory needed for the jobs. Try very hard to make this accurate. DEFAULT = 4gb #SBATCH --mem=4gb # # Number of tasks (one for each CPU desired for use case) (example): #SBATCH --ntasks=1 # # Processors per task: # At least eight times the number of GPUs needed for nVidia RTX A5500 #SBATCH --cpus-per-task=1 # # Number of GPUs, this can be in the format of "--gres=gpu:[1-8]", or "--gres=gpu:A5500:[1-8]" with the type included (optional) #SBATCH --gres=gpu:1 # # Standard output and error log #SBATCH --output=serial_test_%j.log # # Wall clock limit in hrs:min:sec: #SBATCH --time=00:00:30 # ## Command(s) to run (example): pwd; hostname; date echo "Running test script on a single CPU core" sleep 5 echo "Test done!" date Keep the "SBATCH" lines commented, the scheduler will read them anyway. If you don't need a particular option, just don't include it in the file. To submit the batch job: % sbatch slurm-test.sh Submitted batch job 7 The job(s) will then be scheduled. You can see the state of the queue as such: % squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 7 batch weiler_t weiler R 0:07 1 phoenix-01 The job will output any STDOUT or STDERR in the directory you launched the job from. Other than that, it will do whatever the job does, even if there is no STDOUT. == Launching Several Jobs at Once == You can launch many jobs at once using the $SLURM_ARRAY_TASK_ID variable. Add something like the following to your batch submission file: #SBATCH --array=0-31 #SBATCH --output=array_job_%A_task_%a.out #SBATCH --error=array_job_%A_task_%a.err ## Command(s) to run: echo "I am task $SLURM_ARRAY_TASK_ID" == CGROUPS and Resource Management == Our installation of Slurm will utilize Linux CGROUPS, which puts a hard resource cap on jobs. If you define that your job will need 4GB of RAM, and it uses 5GB, it will fail with an OOM exception. Likewise with CPU or GPU resources; if your job ends up using more than you specify, it will fail. Or the "--time" batch file option, your job will fail if it takes longer than what you specify there. This is to keep the nodes from crashing from runaway jobs that use more resources than you think they will. So... TEST YOUR JOBS! Find out how much in resources a single job needs before you launch 100 of them. == TEST YOUR JOBS! == Let me say that one more time. Test your jobs before launching a bunch of them! If it fails, you don't want it to fail 100 or more times. You can also get a good idea of how much RAM and CPU it will need so you can better define your batch files. It's critical to get a good idea of how many resources each of your jobs will use and define your job file appropriately. 6fff83983ab62bc31b3fde21c4dbe0b0d362988c 497 496 2024-05-17T21:01:33Z Weiler 3 /* Submit a Slurm Batch Job */ wikitext text/x-wiki When using Slurm, you will need to log into one of the interactive compute servers in the PRISM area (such as emerald, mustard, crimson or razzmatazz). Once you have ssh'd in there, you can execute slurm batch or interactive commands. You might also want to consult the [[Quick Reference Guide]]. == Submit a Slurm Batch Job == In order to submit a Slurm batch job list, you will need to create a directory that you will have both read and write access to on all the nodes (which will often be a shared space). Let's say I have a batch named "experiment-1". I would create that directory in my groups area: % mkdir -p /private/groups/clusteradmin/weiler/slurm-jobs/experiment-1 % cd /private/groups/clusteradmin/weiler/slurm-jobs/experiment-1 Then you will need to create your job submission batch file. It will look something like this. My file is called 'slurm-test.sh': % vim slurm-test.sh Then populate the file as necessary: #!/bin/bash # Job name: #SBATCH --job-name=weiler_test # # Partition - This is the queue it goes in: #SBATCH --partition=short # # Where to send email (optional) #SBATCH --mail-user=weiler@ucsc.edu # # Number of nodes you need per job: #SBATCH --nodes=1 # # Memory needed for the jobs. Try very hard to make this accurate. DEFAULT = 4gb #SBATCH --mem=4gb # # Number of tasks (one for each CPU desired for use case) (example): #SBATCH --ntasks=1 # # Processors per task: # At least eight times the number of GPUs needed for nVidia RTX A5500 #SBATCH --cpus-per-task=1 # # Number of GPUs, this can be in the format of "--gres=gpu:[1-8]", or "--gres=gpu:A5500:[1-8]" with the type included (optional) #SBATCH --gres=gpu:1 # # Standard output and error log #SBATCH --output=serial_test_%j.log # # Wall clock limit in hrs:min:sec: #SBATCH --time=00:00:30 # ## Command(s) to run (example): pwd; hostname; date echo "Running test script on a single CPU core" sleep 5 echo "Test done!" date Keep the "SBATCH" lines commented, the scheduler will read them anyway. If you don't need a particular option, just don't include it in the file. To submit the batch job: % sbatch slurm-test.sh Submitted batch job 7 The job(s) will then be scheduled. You can see the state of the queue as such: % squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 7 batch weiler_t weiler R 0:07 1 phoenix-01 The job will output any STDOUT or STDERR in the directory you launched the job from. Other than that, it will do whatever the job does, even if there is no STDOUT. == Launching Several Jobs at Once == You can launch many jobs at once using the $SLURM_ARRAY_TASK_ID variable. Add something like the following to your batch submission file: #SBATCH --array=0-31 #SBATCH --output=array_job_%A_task_%a.out #SBATCH --error=array_job_%A_task_%a.err ## Command(s) to run: echo "I am task $SLURM_ARRAY_TASK_ID" == CGROUPS and Resource Management == Our installation of Slurm will utilize Linux CGROUPS, which puts a hard resource cap on jobs. If you define that your job will need 4GB of RAM, and it uses 5GB, it will fail with an OOM exception. Likewise with CPU or GPU resources; if your job ends up using more than you specify, it will fail. Or the "--time" batch file option, your job will fail if it takes longer than what you specify there. This is to keep the nodes from crashing from runaway jobs that use more resources than you think they will. So... TEST YOUR JOBS! Find out how much in resources a single job needs before you launch 100 of them. == TEST YOUR JOBS! == Let me say that one more time. Test your jobs before launching a bunch of them! If it fails, you don't want it to fail 100 or more times. You can also get a good idea of how much RAM and CPU it will need so you can better define your batch files. It's critical to get a good idea of how many resources each of your jobs will use and define your job file appropriately. 7f6cbeb5122b5e03ffbb35d35e9b7e423965e79e GPU Resources 0 36 503 432 2024-06-28T16:32:15Z Anovak 4 Add partition and time wikitext text/x-wiki When submitting jobs, you can ask for GPUs in one of two ways. One is: #SBATCH --partition=gpu #SBATCH --gres=gpu:1 That will ask for 1 GPU generically on a node with a free GPU. This request is more specific: #SBATCH --partition=gpu #SBATCH --gres=gpu:A5500:3 That requests 3 A5500 GPUs '''only'''. We have several GPU types on the cluster which may fit your specific needs: nVidia RTX A5500 : 24GB RAM nVidia A100 : 80GB RAM For the most part, Slurm takes care of making sure that each job only sees and used the GPUs assigned to it. Within the job, '''CUDA_VISIBLE_DEVICES''' will be set in the environment, but it will always be set to a list of your requested number of GPUs, starting at 0. Slurm re-numbers the GPUs assigned to each job to appear to start at 0, within the job. If you need access to the "real" GPU numbers (to log or to pass along to Docker), they are available in the '''SLURM_JOB_GPUS''' (for '''sbatch''') or '''SLURM_STEP_GPUS''' (for '''srun''') environment variable. ==Running GPU Workloads== To actually use an nVidia GPU, you need to run a program that uses the CUDA API. There are a few ways to obtain such a program. ===Prebuilt CUDA Applications=== The Slurm cluster nodes have the nVidia drivers installed, as well as basic CUDA tools like nvidia-smi. Some projects, such as tensorflow, may ship pre-built binaries that can use CUDA. You should be able to run these binaries directly, if you download them. ===Building CUDA Applications=== The cluster nodes do not have the full CUDA Toolkit. In particular, they do not have the '''nvcc''' CUDA-enabled compiler. If you want to compile applications that use CUDA, you will need to install the development environment yourself for your user. Once you have '''nvcc''' available to your user, building CUDA applications should work. To run them, you will have to submit them as jobs, because the head node does not have a GPU. ===Containerized GPU Workloads=== Instead of directly installing binaries, or installing and using the CUDA Toolkit, it is often easiest to use containers to download a prebuilt GPU workload and all of its libraries and dependencies. THere are a few options for running containerized GPU workloads on the cluster. ====Running Containers in Singularity==== You can run containers on the cluster using Singularity, and give them access to the GPUs that Slurm has selected using the '''--nv''' option. For example: singularity pull docker://tensorflow/tensorflow:latest-gpu srun -c 8 --mem 10G --partition=gpu --time=00:20:00 --gres=gpu:1 singularity run --nv docker://tensorflow/tensorflow:latest-gpu python -c 'from tensorflow.python.client import device_lib; print(device_lib.list_local_devices())' This will produce output showing that the Tensorflow container is indeed able to talk to one GPU: INFO: Using cached SIF image 2023-05-15 11:36:33.110850: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-05-15 11:36:38.799035: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /device:GPU:0 with 22244 MB memory: -> device: 0, name: NVIDIA RTX A5500, pci bus id: 0000:03:00.0, compute capability: 8.6 [name: "/device:CPU:0" device_type: "CPU" memory_limit: 268435456 locality { } incarnation: 8527638019084870106 xla_global_id: -1 , name: "/device:GPU:0" device_type: "GPU" memory_limit: 23324655616 locality { bus_id: 1 links { } } incarnation: 1860154623440434360 physical_device_desc: "device: 0, name: NVIDIA RTX A5500, pci bus id: 0000:03:00.0, compute capability: 8.6" xla_global_id: 416903419 ] Slurm's containment of the Slurm job to the correct set of GPUs is also passed through to the Singularity container; there is no need to specifically direct Singularity to use the right GPUs unless you are doing something unusual. ====Running Containers in Slurm==== Slurm itself also supports a '''--container''' option for jobs, which allows a whole job to be run inside a container. If you are able to [https://slurm.schedmd.com/containers.html convert your container to OCI Bundle format], you can pass it directly to Slurm instead of using Singularity from inside the job. However, Docker-compatible image specifiers can't be given to Slurm, only paths to OCI bundles on disk. Stnad-alone tools to download a Docker image from Docker Hub in OCI bundle format ('''skopeo''' and '''umoci''') are not yet installed on the cluster. But the method using the '''docker''' command should work. Slurm containers ''should'' have access to their assigned GPUs, but it is not clear if tools like '''nvidia-smi''' are injected into the container, as they would be with Singularity or the nVidia Container Runtime. ====Running Containers in Docker==== You might be used to running containers with Docker, or containerized GPU workloads with the nVidia Container Runtime or Toolkit. Docker is installed on all the nodes and the daemon is running; if the '''docker''' command does not work for you, ask cluster-admin to add you to the right groups. The '''nvidia''' runtime is set up and will automatically be used. While Slurm configures each Slurm job with a cgroup that directs it to the correct GPUs, '''using Docker to run another container escapes Slurm's confinement''', and using '''--gpus=1''' will ''always'' use the ''first'' GPU in the system, whether that GPU is assigned to your job or not. When using Docker, you ''must'' consult the '''SLURM_JOB_GPUS''' (for '''sbatch''') or '''SLRUM_STEP_GPUS''' (for '''srun''') environment variable and pass that along to your container. You should also impose limits on all other resources used by your Docker container, so that your whole job stays within the resources allocated by Slurm's scheduler. (TODO: find out how cgroups handles oversubscription between a Docker container and the Slurm container that launched it). An example of a working command is: srun -c 1 --mem 4G --partition=gpu --time=00:20:00 --gres=gpu:2 bash -c 'docker run --rm --gpus=\"device=$SLURM_STEP_GPUS\" nvidia/cuda:12.1.1-base-ubuntu22.04 nvidia-smi' Note that the double-quotes are included in the argument to '''--gpus''' as seen by the Docker client, and that '''bash''' and single-quotes are used to ensure that '''$SLURM_STEP_GPUS''' is evaluated within the job itself, and not on the head node. 3a2677bb8c2ccb24541409ed3a88e8238ab1d970 Slurm Queues (Partitions) and Resource Management 0 48 504 474 2024-06-29T16:02:10Z Weiler 3 wikitext text/x-wiki == Partitions == Due to heterogeneous workloads and different batch requirements, we have implemented partitions in slurm, which are similar to queues. Each partition has different default and maximum walltime limits (aka "runtime" limits). You will need to select a partition to launch your jobs in based on what kind of jobs they are and how long they are expected to run. {| class="wikitable" |- style="font-weight:bold;" ! Partition Name ! Default Walltime Limit ! Maximum Walltime Limit ! style="border-color:inherit;" | Default Partition? ! Job Priority ! Maximum Nodes Utilized |- | short | 10 minutes | 1 hour | style="border-color:inherit;" | Yes | Normal | All |- | medium | 1 hour | 12 hours | style="border-color:inherit;" | No | Normal | 15 |- | long | 12 hours | 14 days | style="border-color:inherit;" | No | Normal | 10 |- | high_priority | 10 minutes | 7 days | style="border-color:inherit;" | No | High | All<br /> |- | gpu | 10 minutes | 7 days | No | Normal | 6 |} If you do not specify a partition to run your job in (with e.g. <code>--partition=medium</code>), it will automatically be assigned the "short" partition by default. If you do not specify a walltime value in your job submission script (with e.g. <code>--time=00:30:00</code>), it will inherit the "Default Walltime Limit" of the partition it is assigned. Therefore, it is a very good idea to specify which partition your job will go in, and you should also specify a walltime limit, otherwise your jobs will inherit the default walltime limit in the chart above. This all means that it is very important to '''TEST''' your jobs before running many of them! Submit one job and note how much resources it takes (RAM, CPU) and how long it takes to run. Then when you submit many of those jobs, you can correctly specify the number of CPU cores your job needs, how much RAM it needs (pad it by about 20% just in case), and how much time it needs to run (pad it by 40% to account for environmental variables like disk IO load and CPU context switching load). You can test your jobs by running one job via '''srun''' with fairly high CPU, RAM and walltime limits (just so it isn't killed due to default limits), then noting how much in resources it consumed while running (after it finishes). '''Example''' seff 769059 '''Output''' Job ID: 769059 Cluster: phoenix User/Group: <user-name>/<group-name> State: COMPLETED (exit code 0) Nodes: 1 Cores per node: 16 CPU Utilized: 00:00:01 CPU Efficiency: 0.11% of 00:15:28 core-walltime Job Wall-clock time: 00:00:58 Memory Utilized: 4.79 MB Memory Efficiency: 4.79% of 100.00 MB So if I needed to run like 1000 of these jobs, and they were all similar, I would select the "short" partition, 1 CPU core, maybe specify 8MB RAM, and maybe 90 seconds walltime limit. Note how I padded the RAM and walltime a bit to account for unexpected variable cluster conditions. == '''high_priority''' Partition Notes == The "high_priority" partition is special in that it will have the highest priority of all jobs on the cluster and will push all other jobs aside in an effort to finish jobs in that partition as fast as possible. This is only available for emergency or mission critical batches that need to be completed in an unexpectedly critically fast way. Access to this partition is only granted on a per request basis, and is temporary until your batch finishes. Email '''cluster-admin@soe.ucsc.edu''' if you need to access the high_priority queue and make your case why it is necessary. == My job is not running but I want it to be running == Even if your job is in the high-priority partition, that doesn't mean that the cluster will drop everything and run it immediately. Because we don't have pre-emption set up, high priority jobs still have to wait for currently-running jobs to finish, as well as for other high-priority jobs. And since, as noted above, jobs can be allowed to run for up to 7 days each, it is physically possible for even the highest-priority job in the whole cluster to not start for a whole week. Here is a [https://docs-research-it.berkeley.edu/services/high-performance-computing/user-guide/running-your-jobs/why-job-not-run/ good resource from Berkeley] about understanding and debugging Slurm job scheduling. Basically, Slurm uses the wall-clock limits of running jobs, and of jobs in the queue, to make a plan to start each job on some node at some time in the future. If jobs finish early, other jobs can start sooner than scheduled, and if there is space around higher-priority jobs, lower-priority jobs can be filled in. If you want to know when Slurm plans to run your job, and why that is not right now, you can use the <code>--start</code> option for the <code>squeue</code> command: $ squeue -j 1719584 --start JOBID PARTITION NAME USER ST START_TIME NODES SCHEDNODES NODELIST(REASON) 1719584 short snakemak flastnam PD 2024-01-22T10:20:00 1 phoenix-00 (Priority) The <code>START_TIME</code> column is the time by which Slurm is sure it will be able to start your job if no higher-priority jobs come in first, and the <code>NODELIST(REASON)</code> column shows the nodes the job is running on, or the reason it is not running now, in parentheses. In this case, the job is not running because higher-priority jobs are in the way. c608da236cd130864be84897fbaa66b9a3dd1a6c Phoenix WDL Tutorial 0 45 505 473 2024-07-16T20:06:55Z Anovak 4 /* Connecting to Phoenix */ Use emerald login node wikitext text/x-wiki '''Tutorial: Getting Started with WDL Workflows on Phoenix''' Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you. Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order. This tutorial will help you get started writing and running workflows. The '''Phoenix Cluster Setup''' section is specifically for the UC Santa Cruz Genomics Institute's Phoenix Slurm cluster. The other sections are broadly applicable to other environments. By the end, you will be able to run workflows on Slurm with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong. =Phoenix Cluster Setup= Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH. ==Getting VPN access== We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall. To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster. ==Connecting to Phoenix== Once you have VPN access, you can connect to any of the machines with access to the Phoenix cluster. These interactive nodes are fairly large machines that can do some work locally, but you will still want to run larger workflows on the actual cluster. For this tutorial, we will use <code>emerald.prism</code> as our login node. To connect to the cluster: 1. Connect to the VPN. 2. SSH to <code>emerald.prism</code>. At the command line, run: ssh emerald.prism If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run: ssh flastname@emerald.prism The first time you connect, you will see a message like: The authenticity of host 'emerald.prism (10.50.1.67)' can't be established. ED25519 key fingerprint is SHA256:8hJQShO6jhrym9UVyMldKsKOnOFtWRChgjK5cZNhkAI. This key is not known by any other names. Are you sure you want to continue connecting (yes/no/[fingerprint])? This is your computer asking you to help it decide if it is talking to the genuine <code>emerald.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:8hJQShO6jhrym9UVyMldKsKOnOFtWRChgjK5cZNhkAI.</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is. ==Installing Toil with WDL support== Once you are on the head node, you can install Toil, a program for running workflows. When installing, you need to specify that you want WDL support. To do this, you can run: pip install --upgrade --user 'toil[wdl]' If you also want to use AWS S3 <code>s3://</code> and/or Google <code>gs://</code> URLs for data, you will need to also install Toil with the <code>aws</code> and <code>google</code> extras, respectively: pip install --upgrade --user 'toil[wdl,aws,google]' This will install Toil in the <code>.local</code> directory inside your home directory, which we write as <code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>. By default, the command interpreter *will not* look there, so if you type <code>toil-wdl-runner</code>, it will complain that the command is not found. To fix this, you need to configure the command interpreter (bash) to look where Toil is installed. To do this, run: echo 'export PATH="${HOME}/.local/bin:${PATH}"' >>~/.bashrc After that, **log out and log back in**, to restart bash and pick up the change. To make sure it worked, you can run: toil-wdl-runner --help If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports. If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above. ==Configuring your Phoenix Environment== '''Do not try and store data in your home directory on Phoenix!''' The home directories are meant for code and programs. Any data worth running a workflow on should be in a directory under <code>/private/groups</code>. You will probably need to email the admins to get added to a group so you can create a directory to work in somewhere under <code>/private/groups</code>. Usually you would end up with <code>/private/groups/YOURGROUPNAME/YOURUSERNAME</code>. Remember this path; we will need it later. ==Configuring Toil for Phoenix== Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. However, since these files can be large, and the home directory quota is only 30 GB, we can't keep these in your home directory. We will need to use the <code>/private/groups/YOURGROUPNAME/YOURUSERNAME</code> directory you created earlier. Make sure that that directory is available in your <code>~/.bashrc</code> file by editing and running this command: echo 'BIG_DATA_DIR=/private/groups/YOURGROUPNAME/YOURUSERNAME' >>~/.bashrc Then use these commands to make sure that Toil knows where it ought to put its caches: echo 'export SINGULARITY_CACHEDIR="${BIG_DATA_DIR}/.singularity/cache"' >>~/.bashrc echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="${BIG_DATA_DIR}/.cache/miniwdl"' >>~/.bashrc After that, '''log out and log back in again''', to apply the changes. If you don't do this, Toil will re-download each container image, on each node, for each run of each workflow. That wastes a lot of time, and can exhaust the [https://docs.docker.com/docker-hub/download-rate-limit/#whats-the-download-rate-limit-on-docker-hub limits on how many containers you are allowed to download each day]. '''If you get errors about mutexes, lock files, or other weird problems with Singularity''', try moving these directories to inside <code>/data/tmp</code> on the individual nodes, or unsetting them and letting Toil use its defaults (and exhaust our Docker pull limits). [https://github.com/DataBiosphere/toil/issues/4654 It is not clear that <code>/private/groups</code> actually implements the necessary file locking correctly.] =Running an existing workflow= First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project]. First, go to your user directory under <code>/private/groups</code>, and make a directory to work in. cd /private/groups/YOURGROUPNAME/YOURUSERNAME mkdir workflow-test cd workflow-test Next, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally. wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl ==Preparing an input file== Near the top of the WDL file, there's a section like this: workflow hello_caller { input { File who } This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one. So first, we have to make that list of names. Let's make it in <code>names.txt</code> echo "Mridula Resurrección" >names.txt echo "Gershom Šarlota" >>names.txt echo "Ritchie Ravi" >>names.txt Then, we need to create an ''inputs file'', which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this: echo '{"hello_caller.who": "./names.txt"}' >inputs.json Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification]. ==Testing at small scale single-machine== We are now ready to run the workflow! You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running: srun -c 2 --mem 8G --time=02:00:00 --partition=medium --pty bash -i This will start a new shell that can run for 2 hours; to leave it and go back to the head node you can use <code>exit</code>. In your new shell, run this Toil command: toil-wdl-runner self_test.wdl inputs.json -o local_run This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>. This will print a lot of logging to standard error, and to standard output it will print: {"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]} The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person. To leave your interactive Slurm session and return to the head node, use <code>exit</code>. ==Running at larger scale== Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people! Go get this handy list of people and cut it to length: wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt head -n100 1000_names.txt >100_names.txt And make a new inputs file: echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster. To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. Until Toil [https://github.com/DataBiosphere/toil/issues/4775 gets support for data file caching on Slurm], we will also need the <code>--caching false</code> option. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it. Additionally, since [https://github.com/DataBiosphere/toil/issues/4686 Toil can't manage Slurm partitions itself], we will use the <code>TOIL_SLURM_ARGS</code> environment variable to tell Toil how long jobs should be allowed to run for (2 hours) and what [[Slurm Queues (Partitions) and Resource Management | partition]] they should go in. mkdir -p logs export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium" toil-wdl-runner --jobStore ./big_store --batchSystem slurm --caching false --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory. =Writing your own workflow= In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz]. ==Writing the file== ===Version=== All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files). Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0. So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement: version 1.0 ===Workflow Block=== Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>. version 1.0 workflow FizzBuzz { } ===Input Block=== Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } } Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is ''not'' optional, and there is no default value, then the user's inputs file ''must'' specify a value for it in order for the workflow to run. ===Body=== Now we'll start on the body of the workflow, to be inserted just after the inputs section. The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable. Array[Int] numbers = range(item_count) WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more. ===Scattering=== Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 } ===Conditionals=== Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array. Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only ''once'' in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>. Note that WDL supports conditional ''expressions'' with a <code>then</code> and an <code>else</code>, but conditional ''statements'' only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition. So first, let's handle the special cases. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. } } ===Calling Tasks=== Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later. To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]) } We can put the code into the workflow now, and set about writing the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } ===Writing Tasks=== Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>. task stringify_number { } We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections. task stringify_number { input { Int the_number } # ??? output { String the_string # = ??? } } Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string # = ??? } } Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]). task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } } We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage. Then we can put our task into our WDL file: version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } ===Output Block=== Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren't supposed to need this, but you do need it in 1.1 and Toil doesn't actually send your outputs anywhere yet if you don't have one, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } output { Array[String] fizzbuzz_results = result } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array. ==Running the Workflow== Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs: echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json Then run it on the cluster with Toil: mkdir -p logs export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium" toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --caching false --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json Or locally: toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json =Debugging Workflows= Sometimes, your workflow won't work. Try these ideas for figuring out what is going wrong. ==Restarting the Workflow== If you think your workflow failed from a transient problem (such as a Docker image not being available) that you have fixed, and you ran the workflow with <code>--jobStore</code> set manually to a directory that persists between attempts, you can add <code>--restart</code> to your workflow command and make Toil try again. It will pick up from where it left off and rerun any failed tasks and then the rest of the workflow. This will ''will not'' pick up any changes to your WDL source code files; those are read once at the beginning and not re-read on restart. If restarting the workflow doesn't help, you may need to move on to more advanced debugging techniques. ==Debugging Options== When debugging a workflow, make sure to run the workflow with <code>--logDebug</code>, to set the log level to <code>DEBUG</code>, and with <code>--jobStore /some/path/to/a/shared/directory/it/can/create</code> so that the stored files shipped between jobs are in a place you can access them. When debug logging is on, the log from every Toil job is inserted in the main Toil log between these markers: =========> Toil job log is here <========= Normally, only the logs of failing jobs and the output of commands run from WDL are reproduced like this. ==Reading the Log== When a WDL workflow fails, you are likely to see a message like this: WDL.runtime.error.CommandFailed: task command failed with exit status 1 [2023-07-16T16:23:54-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host phoenix-15.prism This means that the command line command specified by one of your WDL tasks exited with a failing (i.e. nonzero) exit code, which will happen when either the command line command is written wrong, or when the error detection code in the tool you are trying to run detects and reports an error. Go up higher in the log until you find lines that look like: [2024-01-16T20:12:19-0500] [Thread-3 (statsAndLoggingAggregator)] [I] [toil.statsAndLogging] hello_caller.0.hello.stderr follows: And [2024-01-16T20:12:19-0500] [Thread-3 (statsAndLoggingAggregator)] [I] [toil.statsAndLogging] hello_caller.0.hello.stdout follows: These will be followed by the standard error and standard output log data from the task's command. There may be useful information (such as an error message from the underlying tool) in there. ==Reproducing Problems== When trying to fix a failing step, it is useful to be able to run a command outside of Toil or WDL that might reproduce the problem. In addition to getting the standard output and standard error logs as described above, you may also need input files for your tool in order to do this. In the log of your failing Toil taks, look for likes like this: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-4f886176ab8344baaf17dc72fc445445/toplog.sh' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmprwhi6h3q/toplog.sh' [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmpjyksfoko/Sample.bam' ... The <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam</code> part is a Toil file ID, and it is a relative path from your <code>--jobStore</code> value to where the file is stored on disk. So if you ran the workflow with <code>--jobStore /private/groups/patenlab/anovak/jobstore</code>, you would look for this file at: /private/groups/patenlab/anovak/jobstore/files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam ==More Ways of Finding Files== Sometimes, a step might not fail, but you still might want to see the files it is using as input. If you have the job store path, you can use the <code>find</code> command to try and find the files by name. For example, if you want to look at <code>Sample.bam</code>, you can look for it like this: find /path/to/the/jobstore -name "Sample.bam" If you want to find files that were ''uploaded'' from a job, look for lines like this in the job's log: [2023-07-16T15:58:39-0700] [MainThread] [D] [toil.wdl.wdltoil] Virtualized /data/tmp/2846b6012e3e5535add03b363950dd78/cb23/197c/work/bamPerChrs/Sample.chr14.bam as WDL file toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam You can take the <code>toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam</code> URI and URL-decode it with, for example, [https://www.urldecoder.io/], getting this: toilfile:2703483274:0:files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam Then you can take the part after the last colon, <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam</code>, and that is the path relative to the job store where this file can be found. ==Using Development Versions of Toil== Sometimes, bugs will be fixed in the development version of Toil, but not released yet. To try the current development version of Toil, you can install it like this: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl,aws,google]' If you want to use a particular branch or commit, like <code>aaa451b320fc115b3563ced25cb501301cf86f90</code>, you can do: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git@aaa451b320fc115b3563ced25cb501301cf86f90#egg=toil[wdl,aws,google]' ==Frequently Asked Questions== ===I am getting warnings about <code>XDG_RUNTIME_DIR</code>=== You may be seeing warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code> This happens because Slurm is providing Toil with an <code>XDG_RUNTIME_DIR</code> environment variable that points to a directory that doesn't exist, which the XDG spec says it shouldn't be doing. This is a known bug in the GI Slurm configuration, and Toil is letting you know that it is working around it. ===Toil said it was <code>Redirecting logging</code> somewhere, but I can't find that file!=== The Toil worker process for each job will say that it is <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code>, and when running in single machine mode these messages go to the main Toil log. The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil. If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node. =Additional WDL resources= For more information on writing and running WDL workflows, see: * [https://docs.openwdl.org/en/stable/ The WDL dcoumentation] * [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube] * [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification] 35434cff814f177e1af3a0d81debb3b25782c104 506 505 2024-07-16T20:16:39Z Anovak 4 /* Reproducing Problems */ Explain new debug-job features wikitext text/x-wiki '''Tutorial: Getting Started with WDL Workflows on Phoenix''' Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you. Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order. This tutorial will help you get started writing and running workflows. The '''Phoenix Cluster Setup''' section is specifically for the UC Santa Cruz Genomics Institute's Phoenix Slurm cluster. The other sections are broadly applicable to other environments. By the end, you will be able to run workflows on Slurm with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong. =Phoenix Cluster Setup= Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH. ==Getting VPN access== We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall. To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster. ==Connecting to Phoenix== Once you have VPN access, you can connect to any of the machines with access to the Phoenix cluster. These interactive nodes are fairly large machines that can do some work locally, but you will still want to run larger workflows on the actual cluster. For this tutorial, we will use <code>emerald.prism</code> as our login node. To connect to the cluster: 1. Connect to the VPN. 2. SSH to <code>emerald.prism</code>. At the command line, run: ssh emerald.prism If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run: ssh flastname@emerald.prism The first time you connect, you will see a message like: The authenticity of host 'emerald.prism (10.50.1.67)' can't be established. ED25519 key fingerprint is SHA256:8hJQShO6jhrym9UVyMldKsKOnOFtWRChgjK5cZNhkAI. This key is not known by any other names. Are you sure you want to continue connecting (yes/no/[fingerprint])? This is your computer asking you to help it decide if it is talking to the genuine <code>emerald.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:8hJQShO6jhrym9UVyMldKsKOnOFtWRChgjK5cZNhkAI.</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is. ==Installing Toil with WDL support== Once you are on the head node, you can install Toil, a program for running workflows. When installing, you need to specify that you want WDL support. To do this, you can run: pip install --upgrade --user 'toil[wdl]' If you also want to use AWS S3 <code>s3://</code> and/or Google <code>gs://</code> URLs for data, you will need to also install Toil with the <code>aws</code> and <code>google</code> extras, respectively: pip install --upgrade --user 'toil[wdl,aws,google]' This will install Toil in the <code>.local</code> directory inside your home directory, which we write as <code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>. By default, the command interpreter *will not* look there, so if you type <code>toil-wdl-runner</code>, it will complain that the command is not found. To fix this, you need to configure the command interpreter (bash) to look where Toil is installed. To do this, run: echo 'export PATH="${HOME}/.local/bin:${PATH}"' >>~/.bashrc After that, **log out and log back in**, to restart bash and pick up the change. To make sure it worked, you can run: toil-wdl-runner --help If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports. If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above. ==Configuring your Phoenix Environment== '''Do not try and store data in your home directory on Phoenix!''' The home directories are meant for code and programs. Any data worth running a workflow on should be in a directory under <code>/private/groups</code>. You will probably need to email the admins to get added to a group so you can create a directory to work in somewhere under <code>/private/groups</code>. Usually you would end up with <code>/private/groups/YOURGROUPNAME/YOURUSERNAME</code>. Remember this path; we will need it later. ==Configuring Toil for Phoenix== Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. However, since these files can be large, and the home directory quota is only 30 GB, we can't keep these in your home directory. We will need to use the <code>/private/groups/YOURGROUPNAME/YOURUSERNAME</code> directory you created earlier. Make sure that that directory is available in your <code>~/.bashrc</code> file by editing and running this command: echo 'BIG_DATA_DIR=/private/groups/YOURGROUPNAME/YOURUSERNAME' >>~/.bashrc Then use these commands to make sure that Toil knows where it ought to put its caches: echo 'export SINGULARITY_CACHEDIR="${BIG_DATA_DIR}/.singularity/cache"' >>~/.bashrc echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="${BIG_DATA_DIR}/.cache/miniwdl"' >>~/.bashrc After that, '''log out and log back in again''', to apply the changes. If you don't do this, Toil will re-download each container image, on each node, for each run of each workflow. That wastes a lot of time, and can exhaust the [https://docs.docker.com/docker-hub/download-rate-limit/#whats-the-download-rate-limit-on-docker-hub limits on how many containers you are allowed to download each day]. '''If you get errors about mutexes, lock files, or other weird problems with Singularity''', try moving these directories to inside <code>/data/tmp</code> on the individual nodes, or unsetting them and letting Toil use its defaults (and exhaust our Docker pull limits). [https://github.com/DataBiosphere/toil/issues/4654 It is not clear that <code>/private/groups</code> actually implements the necessary file locking correctly.] =Running an existing workflow= First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project]. First, go to your user directory under <code>/private/groups</code>, and make a directory to work in. cd /private/groups/YOURGROUPNAME/YOURUSERNAME mkdir workflow-test cd workflow-test Next, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally. wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl ==Preparing an input file== Near the top of the WDL file, there's a section like this: workflow hello_caller { input { File who } This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one. So first, we have to make that list of names. Let's make it in <code>names.txt</code> echo "Mridula Resurrección" >names.txt echo "Gershom Šarlota" >>names.txt echo "Ritchie Ravi" >>names.txt Then, we need to create an ''inputs file'', which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this: echo '{"hello_caller.who": "./names.txt"}' >inputs.json Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification]. ==Testing at small scale single-machine== We are now ready to run the workflow! You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running: srun -c 2 --mem 8G --time=02:00:00 --partition=medium --pty bash -i This will start a new shell that can run for 2 hours; to leave it and go back to the head node you can use <code>exit</code>. In your new shell, run this Toil command: toil-wdl-runner self_test.wdl inputs.json -o local_run This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>. This will print a lot of logging to standard error, and to standard output it will print: {"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]} The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person. To leave your interactive Slurm session and return to the head node, use <code>exit</code>. ==Running at larger scale== Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people! Go get this handy list of people and cut it to length: wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt head -n100 1000_names.txt >100_names.txt And make a new inputs file: echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster. To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. Until Toil [https://github.com/DataBiosphere/toil/issues/4775 gets support for data file caching on Slurm], we will also need the <code>--caching false</code> option. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it. Additionally, since [https://github.com/DataBiosphere/toil/issues/4686 Toil can't manage Slurm partitions itself], we will use the <code>TOIL_SLURM_ARGS</code> environment variable to tell Toil how long jobs should be allowed to run for (2 hours) and what [[Slurm Queues (Partitions) and Resource Management | partition]] they should go in. mkdir -p logs export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium" toil-wdl-runner --jobStore ./big_store --batchSystem slurm --caching false --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory. =Writing your own workflow= In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz]. ==Writing the file== ===Version=== All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files). Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0. So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement: version 1.0 ===Workflow Block=== Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>. version 1.0 workflow FizzBuzz { } ===Input Block=== Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } } Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is ''not'' optional, and there is no default value, then the user's inputs file ''must'' specify a value for it in order for the workflow to run. ===Body=== Now we'll start on the body of the workflow, to be inserted just after the inputs section. The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable. Array[Int] numbers = range(item_count) WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more. ===Scattering=== Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 } ===Conditionals=== Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array. Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only ''once'' in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>. Note that WDL supports conditional ''expressions'' with a <code>then</code> and an <code>else</code>, but conditional ''statements'' only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition. So first, let's handle the special cases. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. } } ===Calling Tasks=== Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later. To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]) } We can put the code into the workflow now, and set about writing the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } ===Writing Tasks=== Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>. task stringify_number { } We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections. task stringify_number { input { Int the_number } # ??? output { String the_string # = ??? } } Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string # = ??? } } Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]). task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } } We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage. Then we can put our task into our WDL file: version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } ===Output Block=== Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren't supposed to need this, but you do need it in 1.1 and Toil doesn't actually send your outputs anywhere yet if you don't have one, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } output { Array[String] fizzbuzz_results = result } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array. ==Running the Workflow== Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs: echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json Then run it on the cluster with Toil: mkdir -p logs export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium" toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --caching false --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json Or locally: toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json =Debugging Workflows= Sometimes, your workflow won't work. Try these ideas for figuring out what is going wrong. ==Restarting the Workflow== If you think your workflow failed from a transient problem (such as a Docker image not being available) that you have fixed, and you ran the workflow with <code>--jobStore</code> set manually to a directory that persists between attempts, you can add <code>--restart</code> to your workflow command and make Toil try again. It will pick up from where it left off and rerun any failed tasks and then the rest of the workflow. This will ''will not'' pick up any changes to your WDL source code files; those are read once at the beginning and not re-read on restart. If restarting the workflow doesn't help, you may need to move on to more advanced debugging techniques. ==Debugging Options== When debugging a workflow, make sure to run the workflow with <code>--logDebug</code>, to set the log level to <code>DEBUG</code>, and with <code>--jobStore /some/path/to/a/shared/directory/it/can/create</code> so that the stored files shipped between jobs are in a place you can access them. When debug logging is on, the log from every Toil job is inserted in the main Toil log between these markers: =========> Toil job log is here <========= Normally, only the logs of failing jobs and the output of commands run from WDL are reproduced like this. ==Reading the Log== When a WDL workflow fails, you are likely to see a message like this: WDL.runtime.error.CommandFailed: task command failed with exit status 1 [2023-07-16T16:23:54-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host phoenix-15.prism This means that the command line command specified by one of your WDL tasks exited with a failing (i.e. nonzero) exit code, which will happen when either the command line command is written wrong, or when the error detection code in the tool you are trying to run detects and reports an error. Go up higher in the log until you find lines that look like: [2024-01-16T20:12:19-0500] [Thread-3 (statsAndLoggingAggregator)] [I] [toil.statsAndLogging] hello_caller.0.hello.stderr follows: And [2024-01-16T20:12:19-0500] [Thread-3 (statsAndLoggingAggregator)] [I] [toil.statsAndLogging] hello_caller.0.hello.stdout follows: These will be followed by the standard error and standard output log data from the task's command. There may be useful information (such as an error message from the underlying tool) in there. ==Reproducing Problems== When trying to fix a failing step, it is useful to be able to run a command outside of Toil or WDL that might reproduce the problem. In addition to getting the standard output and standard error logs as described above, you may also need input files for your tool in order to do this. === Automatically Fetching Input Files === The <code>toil debug-job</code> command has a <code>--retrieveTaskDirectory</code> option that lets you dump out a directory with all the files that a failing WDL task would use. You can use it like: toil debug-job ./jobstore WDLTaskJob --retrieveTaskDirectory dumpdir If there are multiple failing tasks, you might need to replace <code>WDLTaskJob</code> with the name of one of the failing jobs. See [https://toil.readthedocs.io/en/latest/running/debugging.html#fetching-job-inputs the Toil documentation on retrieving files] for more on how to use this command. === Manually Finding Input Files === If you can't use <code>toil debug-job</code>, you might need to manually dig through the job store for files. In the log of your failing Toil task, look for lines like this: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-4f886176ab8344baaf17dc72fc445445/toplog.sh' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmprwhi6h3q/toplog.sh' [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmpjyksfoko/Sample.bam' ... The <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam</code> part is a Toil file ID, and it is a relative path from your <code>--jobStore</code> value to where the file is stored on disk. So if you ran the workflow with <code>--jobStore /private/groups/patenlab/anovak/jobstore</code>, you would look for this file at: /private/groups/patenlab/anovak/jobstore/files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam ==More Ways of Finding Files== Sometimes, a step might not fail, but you still might want to see the files it is using as input. If you have the job store path, you can use the <code>find</code> command to try and find the files by name. For example, if you want to look at <code>Sample.bam</code>, you can look for it like this: find /path/to/the/jobstore -name "Sample.bam" If you want to find files that were ''uploaded'' from a job, look for lines like this in the job's log: [2023-07-16T15:58:39-0700] [MainThread] [D] [toil.wdl.wdltoil] Virtualized /data/tmp/2846b6012e3e5535add03b363950dd78/cb23/197c/work/bamPerChrs/Sample.chr14.bam as WDL file toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam You can take the <code>toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam</code> URI and URL-decode it with, for example, [https://www.urldecoder.io/], getting this: toilfile:2703483274:0:files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam Then you can take the part after the last colon, <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam</code>, and that is the path relative to the job store where this file can be found. ==Using Development Versions of Toil== Sometimes, bugs will be fixed in the development version of Toil, but not released yet. To try the current development version of Toil, you can install it like this: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl,aws,google]' If you want to use a particular branch or commit, like <code>aaa451b320fc115b3563ced25cb501301cf86f90</code>, you can do: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git@aaa451b320fc115b3563ced25cb501301cf86f90#egg=toil[wdl,aws,google]' ==Frequently Asked Questions== ===I am getting warnings about <code>XDG_RUNTIME_DIR</code>=== You may be seeing warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code> This happens because Slurm is providing Toil with an <code>XDG_RUNTIME_DIR</code> environment variable that points to a directory that doesn't exist, which the XDG spec says it shouldn't be doing. This is a known bug in the GI Slurm configuration, and Toil is letting you know that it is working around it. ===Toil said it was <code>Redirecting logging</code> somewhere, but I can't find that file!=== The Toil worker process for each job will say that it is <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code>, and when running in single machine mode these messages go to the main Toil log. The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil. If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node. =Additional WDL resources= For more information on writing and running WDL workflows, see: * [https://docs.openwdl.org/en/stable/ The WDL dcoumentation] * [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube] * [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification] 171095d21b0be8fa6b96e9d38687a3fdf4fd7083 507 506 2024-07-16T20:20:12Z Anovak 4 /* Reading the Log */ Explain --writeLogs wikitext text/x-wiki '''Tutorial: Getting Started with WDL Workflows on Phoenix''' Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you. Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order. This tutorial will help you get started writing and running workflows. The '''Phoenix Cluster Setup''' section is specifically for the UC Santa Cruz Genomics Institute's Phoenix Slurm cluster. The other sections are broadly applicable to other environments. By the end, you will be able to run workflows on Slurm with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong. =Phoenix Cluster Setup= Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH. ==Getting VPN access== We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall. To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster. ==Connecting to Phoenix== Once you have VPN access, you can connect to any of the machines with access to the Phoenix cluster. These interactive nodes are fairly large machines that can do some work locally, but you will still want to run larger workflows on the actual cluster. For this tutorial, we will use <code>emerald.prism</code> as our login node. To connect to the cluster: 1. Connect to the VPN. 2. SSH to <code>emerald.prism</code>. At the command line, run: ssh emerald.prism If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run: ssh flastname@emerald.prism The first time you connect, you will see a message like: The authenticity of host 'emerald.prism (10.50.1.67)' can't be established. ED25519 key fingerprint is SHA256:8hJQShO6jhrym9UVyMldKsKOnOFtWRChgjK5cZNhkAI. This key is not known by any other names. Are you sure you want to continue connecting (yes/no/[fingerprint])? This is your computer asking you to help it decide if it is talking to the genuine <code>emerald.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:8hJQShO6jhrym9UVyMldKsKOnOFtWRChgjK5cZNhkAI.</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is. ==Installing Toil with WDL support== Once you are on the head node, you can install Toil, a program for running workflows. When installing, you need to specify that you want WDL support. To do this, you can run: pip install --upgrade --user 'toil[wdl]' If you also want to use AWS S3 <code>s3://</code> and/or Google <code>gs://</code> URLs for data, you will need to also install Toil with the <code>aws</code> and <code>google</code> extras, respectively: pip install --upgrade --user 'toil[wdl,aws,google]' This will install Toil in the <code>.local</code> directory inside your home directory, which we write as <code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>. By default, the command interpreter *will not* look there, so if you type <code>toil-wdl-runner</code>, it will complain that the command is not found. To fix this, you need to configure the command interpreter (bash) to look where Toil is installed. To do this, run: echo 'export PATH="${HOME}/.local/bin:${PATH}"' >>~/.bashrc After that, **log out and log back in**, to restart bash and pick up the change. To make sure it worked, you can run: toil-wdl-runner --help If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports. If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above. ==Configuring your Phoenix Environment== '''Do not try and store data in your home directory on Phoenix!''' The home directories are meant for code and programs. Any data worth running a workflow on should be in a directory under <code>/private/groups</code>. You will probably need to email the admins to get added to a group so you can create a directory to work in somewhere under <code>/private/groups</code>. Usually you would end up with <code>/private/groups/YOURGROUPNAME/YOURUSERNAME</code>. Remember this path; we will need it later. ==Configuring Toil for Phoenix== Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. However, since these files can be large, and the home directory quota is only 30 GB, we can't keep these in your home directory. We will need to use the <code>/private/groups/YOURGROUPNAME/YOURUSERNAME</code> directory you created earlier. Make sure that that directory is available in your <code>~/.bashrc</code> file by editing and running this command: echo 'BIG_DATA_DIR=/private/groups/YOURGROUPNAME/YOURUSERNAME' >>~/.bashrc Then use these commands to make sure that Toil knows where it ought to put its caches: echo 'export SINGULARITY_CACHEDIR="${BIG_DATA_DIR}/.singularity/cache"' >>~/.bashrc echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="${BIG_DATA_DIR}/.cache/miniwdl"' >>~/.bashrc After that, '''log out and log back in again''', to apply the changes. If you don't do this, Toil will re-download each container image, on each node, for each run of each workflow. That wastes a lot of time, and can exhaust the [https://docs.docker.com/docker-hub/download-rate-limit/#whats-the-download-rate-limit-on-docker-hub limits on how many containers you are allowed to download each day]. '''If you get errors about mutexes, lock files, or other weird problems with Singularity''', try moving these directories to inside <code>/data/tmp</code> on the individual nodes, or unsetting them and letting Toil use its defaults (and exhaust our Docker pull limits). [https://github.com/DataBiosphere/toil/issues/4654 It is not clear that <code>/private/groups</code> actually implements the necessary file locking correctly.] =Running an existing workflow= First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project]. First, go to your user directory under <code>/private/groups</code>, and make a directory to work in. cd /private/groups/YOURGROUPNAME/YOURUSERNAME mkdir workflow-test cd workflow-test Next, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally. wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl ==Preparing an input file== Near the top of the WDL file, there's a section like this: workflow hello_caller { input { File who } This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one. So first, we have to make that list of names. Let's make it in <code>names.txt</code> echo "Mridula Resurrección" >names.txt echo "Gershom Šarlota" >>names.txt echo "Ritchie Ravi" >>names.txt Then, we need to create an ''inputs file'', which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this: echo '{"hello_caller.who": "./names.txt"}' >inputs.json Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification]. ==Testing at small scale single-machine== We are now ready to run the workflow! You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running: srun -c 2 --mem 8G --time=02:00:00 --partition=medium --pty bash -i This will start a new shell that can run for 2 hours; to leave it and go back to the head node you can use <code>exit</code>. In your new shell, run this Toil command: toil-wdl-runner self_test.wdl inputs.json -o local_run This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>. This will print a lot of logging to standard error, and to standard output it will print: {"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]} The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person. To leave your interactive Slurm session and return to the head node, use <code>exit</code>. ==Running at larger scale== Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people! Go get this handy list of people and cut it to length: wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt head -n100 1000_names.txt >100_names.txt And make a new inputs file: echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster. To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. Until Toil [https://github.com/DataBiosphere/toil/issues/4775 gets support for data file caching on Slurm], we will also need the <code>--caching false</code> option. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it. Additionally, since [https://github.com/DataBiosphere/toil/issues/4686 Toil can't manage Slurm partitions itself], we will use the <code>TOIL_SLURM_ARGS</code> environment variable to tell Toil how long jobs should be allowed to run for (2 hours) and what [[Slurm Queues (Partitions) and Resource Management | partition]] they should go in. mkdir -p logs export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium" toil-wdl-runner --jobStore ./big_store --batchSystem slurm --caching false --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory. =Writing your own workflow= In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz]. ==Writing the file== ===Version=== All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files). Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0. So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement: version 1.0 ===Workflow Block=== Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>. version 1.0 workflow FizzBuzz { } ===Input Block=== Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } } Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is ''not'' optional, and there is no default value, then the user's inputs file ''must'' specify a value for it in order for the workflow to run. ===Body=== Now we'll start on the body of the workflow, to be inserted just after the inputs section. The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable. Array[Int] numbers = range(item_count) WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more. ===Scattering=== Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 } ===Conditionals=== Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array. Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only ''once'' in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>. Note that WDL supports conditional ''expressions'' with a <code>then</code> and an <code>else</code>, but conditional ''statements'' only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition. So first, let's handle the special cases. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. } } ===Calling Tasks=== Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later. To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]) } We can put the code into the workflow now, and set about writing the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } ===Writing Tasks=== Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>. task stringify_number { } We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections. task stringify_number { input { Int the_number } # ??? output { String the_string # = ??? } } Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string # = ??? } } Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]). task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } } We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage. Then we can put our task into our WDL file: version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } ===Output Block=== Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren't supposed to need this, but you do need it in 1.1 and Toil doesn't actually send your outputs anywhere yet if you don't have one, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } output { Array[String] fizzbuzz_results = result } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array. ==Running the Workflow== Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs: echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json Then run it on the cluster with Toil: mkdir -p logs export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium" toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --caching false --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json Or locally: toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json =Debugging Workflows= Sometimes, your workflow won't work. Try these ideas for figuring out what is going wrong. ==Restarting the Workflow== If you think your workflow failed from a transient problem (such as a Docker image not being available) that you have fixed, and you ran the workflow with <code>--jobStore</code> set manually to a directory that persists between attempts, you can add <code>--restart</code> to your workflow command and make Toil try again. It will pick up from where it left off and rerun any failed tasks and then the rest of the workflow. This will ''will not'' pick up any changes to your WDL source code files; those are read once at the beginning and not re-read on restart. If restarting the workflow doesn't help, you may need to move on to more advanced debugging techniques. ==Debugging Options== When debugging a workflow, make sure to run the workflow with <code>--logDebug</code>, to set the log level to <code>DEBUG</code>, and with <code>--jobStore /some/path/to/a/shared/directory/it/can/create</code> so that the stored files shipped between jobs are in a place you can access them. When debug logging is on, the log from every Toil job is inserted in the main Toil log between these markers: =========> Toil job log is here <========= Normally, only the logs of failing jobs and the output of commands run from WDL are reproduced like this. ==Reading the Log== When a WDL workflow fails, you are likely to see a message like this: WDL.runtime.error.CommandFailed: task command failed with exit status 1 [2023-07-16T16:23:54-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host phoenix-15.prism This means that the command line command specified by one of your WDL tasks exited with a failing (i.e. nonzero) exit code, which will happen when either the command line command is written wrong, or when the error detection code in the tool you are trying to run detects and reports an error. Go up higher in the log until you find lines that look like: [2024-01-16T20:12:19-0500] [Thread-3 (statsAndLoggingAggregator)] [I] [toil.statsAndLogging] hello_caller.0.hello.stderr follows: And [2024-01-16T20:12:19-0500] [Thread-3 (statsAndLoggingAggregator)] [I] [toil.statsAndLogging] hello_caller.0.hello.stdout follows: These will be followed by the standard error and standard output log data from the task's command. There may be useful information (such as an error message from the underlying tool) in there. If you would like individual task logs to be saved separately for later reference, you can use the <code>--writeLogs</code> option to specify a directory to store them. For more information, see [https://toil.readthedocs.io/en/latest/wdl/running.html#managing-workflow-logs the Toil documentation of workflow task logs]. ==Reproducing Problems== When trying to fix a failing step, it is useful to be able to run a command outside of Toil or WDL that might reproduce the problem. In addition to getting the standard output and standard error logs as described above, you may also need input files for your tool in order to do this. === Automatically Fetching Input Files === The <code>toil debug-job</code> command has a <code>--retrieveTaskDirectory</code> option that lets you dump out a directory with all the files that a failing WDL task would use. You can use it like: toil debug-job ./jobstore WDLTaskJob --retrieveTaskDirectory dumpdir If there are multiple failing tasks, you might need to replace <code>WDLTaskJob</code> with the name of one of the failing jobs. See [https://toil.readthedocs.io/en/latest/running/debugging.html#fetching-job-inputs the Toil documentation on retrieving files] for more on how to use this command. === Manually Finding Input Files === If you can't use <code>toil debug-job</code>, you might need to manually dig through the job store for files. In the log of your failing Toil task, look for lines like this: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-4f886176ab8344baaf17dc72fc445445/toplog.sh' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmprwhi6h3q/toplog.sh' [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmpjyksfoko/Sample.bam' ... The <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam</code> part is a Toil file ID, and it is a relative path from your <code>--jobStore</code> value to where the file is stored on disk. So if you ran the workflow with <code>--jobStore /private/groups/patenlab/anovak/jobstore</code>, you would look for this file at: /private/groups/patenlab/anovak/jobstore/files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam ==More Ways of Finding Files== Sometimes, a step might not fail, but you still might want to see the files it is using as input. If you have the job store path, you can use the <code>find</code> command to try and find the files by name. For example, if you want to look at <code>Sample.bam</code>, you can look for it like this: find /path/to/the/jobstore -name "Sample.bam" If you want to find files that were ''uploaded'' from a job, look for lines like this in the job's log: [2023-07-16T15:58:39-0700] [MainThread] [D] [toil.wdl.wdltoil] Virtualized /data/tmp/2846b6012e3e5535add03b363950dd78/cb23/197c/work/bamPerChrs/Sample.chr14.bam as WDL file toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam You can take the <code>toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam</code> URI and URL-decode it with, for example, [https://www.urldecoder.io/], getting this: toilfile:2703483274:0:files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam Then you can take the part after the last colon, <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam</code>, and that is the path relative to the job store where this file can be found. ==Using Development Versions of Toil== Sometimes, bugs will be fixed in the development version of Toil, but not released yet. To try the current development version of Toil, you can install it like this: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl,aws,google]' If you want to use a particular branch or commit, like <code>aaa451b320fc115b3563ced25cb501301cf86f90</code>, you can do: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git@aaa451b320fc115b3563ced25cb501301cf86f90#egg=toil[wdl,aws,google]' ==Frequently Asked Questions== ===I am getting warnings about <code>XDG_RUNTIME_DIR</code>=== You may be seeing warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code> This happens because Slurm is providing Toil with an <code>XDG_RUNTIME_DIR</code> environment variable that points to a directory that doesn't exist, which the XDG spec says it shouldn't be doing. This is a known bug in the GI Slurm configuration, and Toil is letting you know that it is working around it. ===Toil said it was <code>Redirecting logging</code> somewhere, but I can't find that file!=== The Toil worker process for each job will say that it is <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code>, and when running in single machine mode these messages go to the main Toil log. The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil. If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node. =Additional WDL resources= For more information on writing and running WDL workflows, see: * [https://docs.openwdl.org/en/stable/ The WDL dcoumentation] * [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube] * [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification] b3a668ea329b023cf051b45a0a5aac1cc54555df 508 507 2024-07-16T20:24:49Z Anovak 4 /* Frequently Asked Questions */ Note there's no more XDG_RUNTIME_DIR warning. wikitext text/x-wiki '''Tutorial: Getting Started with WDL Workflows on Phoenix''' Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you. Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order. This tutorial will help you get started writing and running workflows. The '''Phoenix Cluster Setup''' section is specifically for the UC Santa Cruz Genomics Institute's Phoenix Slurm cluster. The other sections are broadly applicable to other environments. By the end, you will be able to run workflows on Slurm with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong. =Phoenix Cluster Setup= Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH. ==Getting VPN access== We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall. To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster. ==Connecting to Phoenix== Once you have VPN access, you can connect to any of the machines with access to the Phoenix cluster. These interactive nodes are fairly large machines that can do some work locally, but you will still want to run larger workflows on the actual cluster. For this tutorial, we will use <code>emerald.prism</code> as our login node. To connect to the cluster: 1. Connect to the VPN. 2. SSH to <code>emerald.prism</code>. At the command line, run: ssh emerald.prism If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run: ssh flastname@emerald.prism The first time you connect, you will see a message like: The authenticity of host 'emerald.prism (10.50.1.67)' can't be established. ED25519 key fingerprint is SHA256:8hJQShO6jhrym9UVyMldKsKOnOFtWRChgjK5cZNhkAI. This key is not known by any other names. Are you sure you want to continue connecting (yes/no/[fingerprint])? This is your computer asking you to help it decide if it is talking to the genuine <code>emerald.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:8hJQShO6jhrym9UVyMldKsKOnOFtWRChgjK5cZNhkAI.</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is. ==Installing Toil with WDL support== Once you are on the head node, you can install Toil, a program for running workflows. When installing, you need to specify that you want WDL support. To do this, you can run: pip install --upgrade --user 'toil[wdl]' If you also want to use AWS S3 <code>s3://</code> and/or Google <code>gs://</code> URLs for data, you will need to also install Toil with the <code>aws</code> and <code>google</code> extras, respectively: pip install --upgrade --user 'toil[wdl,aws,google]' This will install Toil in the <code>.local</code> directory inside your home directory, which we write as <code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>. By default, the command interpreter *will not* look there, so if you type <code>toil-wdl-runner</code>, it will complain that the command is not found. To fix this, you need to configure the command interpreter (bash) to look where Toil is installed. To do this, run: echo 'export PATH="${HOME}/.local/bin:${PATH}"' >>~/.bashrc After that, **log out and log back in**, to restart bash and pick up the change. To make sure it worked, you can run: toil-wdl-runner --help If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports. If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above. ==Configuring your Phoenix Environment== '''Do not try and store data in your home directory on Phoenix!''' The home directories are meant for code and programs. Any data worth running a workflow on should be in a directory under <code>/private/groups</code>. You will probably need to email the admins to get added to a group so you can create a directory to work in somewhere under <code>/private/groups</code>. Usually you would end up with <code>/private/groups/YOURGROUPNAME/YOURUSERNAME</code>. Remember this path; we will need it later. ==Configuring Toil for Phoenix== Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. However, since these files can be large, and the home directory quota is only 30 GB, we can't keep these in your home directory. We will need to use the <code>/private/groups/YOURGROUPNAME/YOURUSERNAME</code> directory you created earlier. Make sure that that directory is available in your <code>~/.bashrc</code> file by editing and running this command: echo 'BIG_DATA_DIR=/private/groups/YOURGROUPNAME/YOURUSERNAME' >>~/.bashrc Then use these commands to make sure that Toil knows where it ought to put its caches: echo 'export SINGULARITY_CACHEDIR="${BIG_DATA_DIR}/.singularity/cache"' >>~/.bashrc echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="${BIG_DATA_DIR}/.cache/miniwdl"' >>~/.bashrc After that, '''log out and log back in again''', to apply the changes. If you don't do this, Toil will re-download each container image, on each node, for each run of each workflow. That wastes a lot of time, and can exhaust the [https://docs.docker.com/docker-hub/download-rate-limit/#whats-the-download-rate-limit-on-docker-hub limits on how many containers you are allowed to download each day]. '''If you get errors about mutexes, lock files, or other weird problems with Singularity''', try moving these directories to inside <code>/data/tmp</code> on the individual nodes, or unsetting them and letting Toil use its defaults (and exhaust our Docker pull limits). [https://github.com/DataBiosphere/toil/issues/4654 It is not clear that <code>/private/groups</code> actually implements the necessary file locking correctly.] =Running an existing workflow= First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project]. First, go to your user directory under <code>/private/groups</code>, and make a directory to work in. cd /private/groups/YOURGROUPNAME/YOURUSERNAME mkdir workflow-test cd workflow-test Next, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally. wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl ==Preparing an input file== Near the top of the WDL file, there's a section like this: workflow hello_caller { input { File who } This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one. So first, we have to make that list of names. Let's make it in <code>names.txt</code> echo "Mridula Resurrección" >names.txt echo "Gershom Šarlota" >>names.txt echo "Ritchie Ravi" >>names.txt Then, we need to create an ''inputs file'', which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this: echo '{"hello_caller.who": "./names.txt"}' >inputs.json Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification]. ==Testing at small scale single-machine== We are now ready to run the workflow! You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running: srun -c 2 --mem 8G --time=02:00:00 --partition=medium --pty bash -i This will start a new shell that can run for 2 hours; to leave it and go back to the head node you can use <code>exit</code>. In your new shell, run this Toil command: toil-wdl-runner self_test.wdl inputs.json -o local_run This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>. This will print a lot of logging to standard error, and to standard output it will print: {"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]} The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person. To leave your interactive Slurm session and return to the head node, use <code>exit</code>. ==Running at larger scale== Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people! Go get this handy list of people and cut it to length: wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt head -n100 1000_names.txt >100_names.txt And make a new inputs file: echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster. To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. Until Toil [https://github.com/DataBiosphere/toil/issues/4775 gets support for data file caching on Slurm], we will also need the <code>--caching false</code> option. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it. Additionally, since [https://github.com/DataBiosphere/toil/issues/4686 Toil can't manage Slurm partitions itself], we will use the <code>TOIL_SLURM_ARGS</code> environment variable to tell Toil how long jobs should be allowed to run for (2 hours) and what [[Slurm Queues (Partitions) and Resource Management | partition]] they should go in. mkdir -p logs export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium" toil-wdl-runner --jobStore ./big_store --batchSystem slurm --caching false --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory. =Writing your own workflow= In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz]. ==Writing the file== ===Version=== All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files). Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0. So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement: version 1.0 ===Workflow Block=== Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>. version 1.0 workflow FizzBuzz { } ===Input Block=== Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } } Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is ''not'' optional, and there is no default value, then the user's inputs file ''must'' specify a value for it in order for the workflow to run. ===Body=== Now we'll start on the body of the workflow, to be inserted just after the inputs section. The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable. Array[Int] numbers = range(item_count) WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more. ===Scattering=== Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 } ===Conditionals=== Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array. Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only ''once'' in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>. Note that WDL supports conditional ''expressions'' with a <code>then</code> and an <code>else</code>, but conditional ''statements'' only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition. So first, let's handle the special cases. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. } } ===Calling Tasks=== Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later. To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]) } We can put the code into the workflow now, and set about writing the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } ===Writing Tasks=== Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>. task stringify_number { } We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections. task stringify_number { input { Int the_number } # ??? output { String the_string # = ??? } } Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string # = ??? } } Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]). task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } } We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage. Then we can put our task into our WDL file: version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } ===Output Block=== Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren't supposed to need this, but you do need it in 1.1 and Toil doesn't actually send your outputs anywhere yet if you don't have one, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } output { Array[String] fizzbuzz_results = result } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array. ==Running the Workflow== Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs: echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json Then run it on the cluster with Toil: mkdir -p logs export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium" toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --caching false --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json Or locally: toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json =Debugging Workflows= Sometimes, your workflow won't work. Try these ideas for figuring out what is going wrong. ==Restarting the Workflow== If you think your workflow failed from a transient problem (such as a Docker image not being available) that you have fixed, and you ran the workflow with <code>--jobStore</code> set manually to a directory that persists between attempts, you can add <code>--restart</code> to your workflow command and make Toil try again. It will pick up from where it left off and rerun any failed tasks and then the rest of the workflow. This will ''will not'' pick up any changes to your WDL source code files; those are read once at the beginning and not re-read on restart. If restarting the workflow doesn't help, you may need to move on to more advanced debugging techniques. ==Debugging Options== When debugging a workflow, make sure to run the workflow with <code>--logDebug</code>, to set the log level to <code>DEBUG</code>, and with <code>--jobStore /some/path/to/a/shared/directory/it/can/create</code> so that the stored files shipped between jobs are in a place you can access them. When debug logging is on, the log from every Toil job is inserted in the main Toil log between these markers: =========> Toil job log is here <========= Normally, only the logs of failing jobs and the output of commands run from WDL are reproduced like this. ==Reading the Log== When a WDL workflow fails, you are likely to see a message like this: WDL.runtime.error.CommandFailed: task command failed with exit status 1 [2023-07-16T16:23:54-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host phoenix-15.prism This means that the command line command specified by one of your WDL tasks exited with a failing (i.e. nonzero) exit code, which will happen when either the command line command is written wrong, or when the error detection code in the tool you are trying to run detects and reports an error. Go up higher in the log until you find lines that look like: [2024-01-16T20:12:19-0500] [Thread-3 (statsAndLoggingAggregator)] [I] [toil.statsAndLogging] hello_caller.0.hello.stderr follows: And [2024-01-16T20:12:19-0500] [Thread-3 (statsAndLoggingAggregator)] [I] [toil.statsAndLogging] hello_caller.0.hello.stdout follows: These will be followed by the standard error and standard output log data from the task's command. There may be useful information (such as an error message from the underlying tool) in there. If you would like individual task logs to be saved separately for later reference, you can use the <code>--writeLogs</code> option to specify a directory to store them. For more information, see [https://toil.readthedocs.io/en/latest/wdl/running.html#managing-workflow-logs the Toil documentation of workflow task logs]. ==Reproducing Problems== When trying to fix a failing step, it is useful to be able to run a command outside of Toil or WDL that might reproduce the problem. In addition to getting the standard output and standard error logs as described above, you may also need input files for your tool in order to do this. === Automatically Fetching Input Files === The <code>toil debug-job</code> command has a <code>--retrieveTaskDirectory</code> option that lets you dump out a directory with all the files that a failing WDL task would use. You can use it like: toil debug-job ./jobstore WDLTaskJob --retrieveTaskDirectory dumpdir If there are multiple failing tasks, you might need to replace <code>WDLTaskJob</code> with the name of one of the failing jobs. See [https://toil.readthedocs.io/en/latest/running/debugging.html#fetching-job-inputs the Toil documentation on retrieving files] for more on how to use this command. === Manually Finding Input Files === If you can't use <code>toil debug-job</code>, you might need to manually dig through the job store for files. In the log of your failing Toil task, look for lines like this: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-4f886176ab8344baaf17dc72fc445445/toplog.sh' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmprwhi6h3q/toplog.sh' [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmpjyksfoko/Sample.bam' ... The <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam</code> part is a Toil file ID, and it is a relative path from your <code>--jobStore</code> value to where the file is stored on disk. So if you ran the workflow with <code>--jobStore /private/groups/patenlab/anovak/jobstore</code>, you would look for this file at: /private/groups/patenlab/anovak/jobstore/files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam ==More Ways of Finding Files== Sometimes, a step might not fail, but you still might want to see the files it is using as input. If you have the job store path, you can use the <code>find</code> command to try and find the files by name. For example, if you want to look at <code>Sample.bam</code>, you can look for it like this: find /path/to/the/jobstore -name "Sample.bam" If you want to find files that were ''uploaded'' from a job, look for lines like this in the job's log: [2023-07-16T15:58:39-0700] [MainThread] [D] [toil.wdl.wdltoil] Virtualized /data/tmp/2846b6012e3e5535add03b363950dd78/cb23/197c/work/bamPerChrs/Sample.chr14.bam as WDL file toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam You can take the <code>toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam</code> URI and URL-decode it with, for example, [https://www.urldecoder.io/], getting this: toilfile:2703483274:0:files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam Then you can take the part after the last colon, <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam</code>, and that is the path relative to the job store where this file can be found. ==Using Development Versions of Toil== Sometimes, bugs will be fixed in the development version of Toil, but not released yet. To try the current development version of Toil, you can install it like this: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl,aws,google]' If you want to use a particular branch or commit, like <code>aaa451b320fc115b3563ced25cb501301cf86f90</code>, you can do: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git@aaa451b320fc115b3563ced25cb501301cf86f90#egg=toil[wdl,aws,google]' ==Frequently Asked Questions== ===I am getting warnings about <code>XDG_RUNTIME_DIR</code>=== You may be seeing warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code> You should upgrade Toil. [https://github.com/DataBiosphere/toil/commit/ff6bf60ab798a675c20156c749817c4313644b96 Since Toil 6.1.0], Toil no longer issues this warning, and just puts up with bad <code>XDG_RUNTIME_DIR</code> settings. ===Toil said it was <code>Redirecting logging</code> somewhere, but I can't find that file!=== The Toil worker process for each job will say that it is <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code>, and when running in single machine mode these messages go to the main Toil log. The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil. If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node. =Additional WDL resources= For more information on writing and running WDL workflows, see: * [https://docs.openwdl.org/en/stable/ The WDL dcoumentation] * [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube] * [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification] 7a50be257ba669ff75b169df112d24616e5be415 509 508 2024-07-16T20:40:10Z Anovak 4 /* Configuring Toil for Phoenix */ Provide Julian's preferred storage paths, note Ceph bug and default home directory image storage. wikitext text/x-wiki '''Tutorial: Getting Started with WDL Workflows on Phoenix''' Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you. Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order. This tutorial will help you get started writing and running workflows. The '''Phoenix Cluster Setup''' section is specifically for the UC Santa Cruz Genomics Institute's Phoenix Slurm cluster. The other sections are broadly applicable to other environments. By the end, you will be able to run workflows on Slurm with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong. =Phoenix Cluster Setup= Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH. ==Getting VPN access== We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall. To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster. ==Connecting to Phoenix== Once you have VPN access, you can connect to any of the machines with access to the Phoenix cluster. These interactive nodes are fairly large machines that can do some work locally, but you will still want to run larger workflows on the actual cluster. For this tutorial, we will use <code>emerald.prism</code> as our login node. To connect to the cluster: 1. Connect to the VPN. 2. SSH to <code>emerald.prism</code>. At the command line, run: ssh emerald.prism If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run: ssh flastname@emerald.prism The first time you connect, you will see a message like: The authenticity of host 'emerald.prism (10.50.1.67)' can't be established. ED25519 key fingerprint is SHA256:8hJQShO6jhrym9UVyMldKsKOnOFtWRChgjK5cZNhkAI. This key is not known by any other names. Are you sure you want to continue connecting (yes/no/[fingerprint])? This is your computer asking you to help it decide if it is talking to the genuine <code>emerald.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:8hJQShO6jhrym9UVyMldKsKOnOFtWRChgjK5cZNhkAI.</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is. ==Installing Toil with WDL support== Once you are on the head node, you can install Toil, a program for running workflows. When installing, you need to specify that you want WDL support. To do this, you can run: pip install --upgrade --user 'toil[wdl]' If you also want to use AWS S3 <code>s3://</code> and/or Google <code>gs://</code> URLs for data, you will need to also install Toil with the <code>aws</code> and <code>google</code> extras, respectively: pip install --upgrade --user 'toil[wdl,aws,google]' This will install Toil in the <code>.local</code> directory inside your home directory, which we write as <code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>. By default, the command interpreter *will not* look there, so if you type <code>toil-wdl-runner</code>, it will complain that the command is not found. To fix this, you need to configure the command interpreter (bash) to look where Toil is installed. To do this, run: echo 'export PATH="${HOME}/.local/bin:${PATH}"' >>~/.bashrc After that, **log out and log back in**, to restart bash and pick up the change. To make sure it worked, you can run: toil-wdl-runner --help If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports. If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above. ==Configuring your Phoenix Environment== '''Do not try and store data in your home directory on Phoenix!''' The home directories are meant for code and programs. Any data worth running a workflow on should be in a directory under <code>/private/groups</code>. You will probably need to email the admins to get added to a group so you can create a directory to work in somewhere under <code>/private/groups</code>. Usually you would end up with <code>/private/groups/YOURGROUPNAME/YOURUSERNAME</code>. Remember this path; we will need it later. ==Configuring Toil for Phoenix== Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. However, since these files can be large, and the home directory quota is only 30 GB, we might not be able to keep these in your home directory. We would like to be able to store these on the cluster's large storage array, under <code>/private/groups</code>. However, Toil needs to use file locks in these directories to prevent simultaneous Singularity calls from producing internal Singularity errors, and Ceph currently has [https://tracker.ceph.com/issues/65607 a bug where these file locking operations can freeze the Ceph servers]. If you have '''a small number of container images''' that will fit in your home directory, you can keep them there. [https://github.com/DataBiosphere/toil/commit/cb0b291bb7f6212bfe69221dd9f09d72f83e92fb Since Toil 6.1.0], this is the default behavior and you don't need to do anything. (Unless you previously set <code>SINGULARITY_CACHEDIR</code> or <code>MINIWDL__SINGULARITY__IMAGE_CACHE</code>, in which case you need to unset them.) '''If you don't have room in your home directory''' for container images, currently the recommended approach is to use node-local storage under <code>/data/tmp</code>. This results in each node pulling each container image, but images will be saved across workflows. You can set that up for all your workflows with: echo 'export SINGULARITY_CACHEDIR="/data/tmp/$(whoami)/cache/singularity"' >>~/.bashrc echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="/data/tmp/$(whoami)/cache/miniwdl"' >>~/.bashrc Then '''log out and log back in again''', to apply the changes. =Running an existing workflow= First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project]. First, go to your user directory under <code>/private/groups</code>, and make a directory to work in. cd /private/groups/YOURGROUPNAME/YOURUSERNAME mkdir workflow-test cd workflow-test Next, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally. wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl ==Preparing an input file== Near the top of the WDL file, there's a section like this: workflow hello_caller { input { File who } This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one. So first, we have to make that list of names. Let's make it in <code>names.txt</code> echo "Mridula Resurrección" >names.txt echo "Gershom Šarlota" >>names.txt echo "Ritchie Ravi" >>names.txt Then, we need to create an ''inputs file'', which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this: echo '{"hello_caller.who": "./names.txt"}' >inputs.json Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification]. ==Testing at small scale single-machine== We are now ready to run the workflow! You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running: srun -c 2 --mem 8G --time=02:00:00 --partition=medium --pty bash -i This will start a new shell that can run for 2 hours; to leave it and go back to the head node you can use <code>exit</code>. In your new shell, run this Toil command: toil-wdl-runner self_test.wdl inputs.json -o local_run This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>. This will print a lot of logging to standard error, and to standard output it will print: {"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]} The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person. To leave your interactive Slurm session and return to the head node, use <code>exit</code>. ==Running at larger scale== Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people! Go get this handy list of people and cut it to length: wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt head -n100 1000_names.txt >100_names.txt And make a new inputs file: echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster. To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. Until Toil [https://github.com/DataBiosphere/toil/issues/4775 gets support for data file caching on Slurm], we will also need the <code>--caching false</code> option. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it. Additionally, since [https://github.com/DataBiosphere/toil/issues/4686 Toil can't manage Slurm partitions itself], we will use the <code>TOIL_SLURM_ARGS</code> environment variable to tell Toil how long jobs should be allowed to run for (2 hours) and what [[Slurm Queues (Partitions) and Resource Management | partition]] they should go in. mkdir -p logs export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium" toil-wdl-runner --jobStore ./big_store --batchSystem slurm --caching false --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory. =Writing your own workflow= In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz]. ==Writing the file== ===Version=== All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files). Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0. So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement: version 1.0 ===Workflow Block=== Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>. version 1.0 workflow FizzBuzz { } ===Input Block=== Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } } Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is ''not'' optional, and there is no default value, then the user's inputs file ''must'' specify a value for it in order for the workflow to run. ===Body=== Now we'll start on the body of the workflow, to be inserted just after the inputs section. The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable. Array[Int] numbers = range(item_count) WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more. ===Scattering=== Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 } ===Conditionals=== Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array. Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only ''once'' in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>. Note that WDL supports conditional ''expressions'' with a <code>then</code> and an <code>else</code>, but conditional ''statements'' only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition. So first, let's handle the special cases. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. } } ===Calling Tasks=== Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later. To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]) } We can put the code into the workflow now, and set about writing the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } ===Writing Tasks=== Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>. task stringify_number { } We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections. task stringify_number { input { Int the_number } # ??? output { String the_string # = ??? } } Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string # = ??? } } Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]). task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } } We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage. Then we can put our task into our WDL file: version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } ===Output Block=== Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren't supposed to need this, but you do need it in 1.1 and Toil doesn't actually send your outputs anywhere yet if you don't have one, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } output { Array[String] fizzbuzz_results = result } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array. ==Running the Workflow== Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs: echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json Then run it on the cluster with Toil: mkdir -p logs export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium" toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --caching false --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json Or locally: toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json =Debugging Workflows= Sometimes, your workflow won't work. Try these ideas for figuring out what is going wrong. ==Restarting the Workflow== If you think your workflow failed from a transient problem (such as a Docker image not being available) that you have fixed, and you ran the workflow with <code>--jobStore</code> set manually to a directory that persists between attempts, you can add <code>--restart</code> to your workflow command and make Toil try again. It will pick up from where it left off and rerun any failed tasks and then the rest of the workflow. This will ''will not'' pick up any changes to your WDL source code files; those are read once at the beginning and not re-read on restart. If restarting the workflow doesn't help, you may need to move on to more advanced debugging techniques. ==Debugging Options== When debugging a workflow, make sure to run the workflow with <code>--logDebug</code>, to set the log level to <code>DEBUG</code>, and with <code>--jobStore /some/path/to/a/shared/directory/it/can/create</code> so that the stored files shipped between jobs are in a place you can access them. When debug logging is on, the log from every Toil job is inserted in the main Toil log between these markers: =========> Toil job log is here <========= Normally, only the logs of failing jobs and the output of commands run from WDL are reproduced like this. ==Reading the Log== When a WDL workflow fails, you are likely to see a message like this: WDL.runtime.error.CommandFailed: task command failed with exit status 1 [2023-07-16T16:23:54-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host phoenix-15.prism This means that the command line command specified by one of your WDL tasks exited with a failing (i.e. nonzero) exit code, which will happen when either the command line command is written wrong, or when the error detection code in the tool you are trying to run detects and reports an error. Go up higher in the log until you find lines that look like: [2024-01-16T20:12:19-0500] [Thread-3 (statsAndLoggingAggregator)] [I] [toil.statsAndLogging] hello_caller.0.hello.stderr follows: And [2024-01-16T20:12:19-0500] [Thread-3 (statsAndLoggingAggregator)] [I] [toil.statsAndLogging] hello_caller.0.hello.stdout follows: These will be followed by the standard error and standard output log data from the task's command. There may be useful information (such as an error message from the underlying tool) in there. If you would like individual task logs to be saved separately for later reference, you can use the <code>--writeLogs</code> option to specify a directory to store them. For more information, see [https://toil.readthedocs.io/en/latest/wdl/running.html#managing-workflow-logs the Toil documentation of workflow task logs]. ==Reproducing Problems== When trying to fix a failing step, it is useful to be able to run a command outside of Toil or WDL that might reproduce the problem. In addition to getting the standard output and standard error logs as described above, you may also need input files for your tool in order to do this. === Automatically Fetching Input Files === The <code>toil debug-job</code> command has a <code>--retrieveTaskDirectory</code> option that lets you dump out a directory with all the files that a failing WDL task would use. You can use it like: toil debug-job ./jobstore WDLTaskJob --retrieveTaskDirectory dumpdir If there are multiple failing tasks, you might need to replace <code>WDLTaskJob</code> with the name of one of the failing jobs. See [https://toil.readthedocs.io/en/latest/running/debugging.html#fetching-job-inputs the Toil documentation on retrieving files] for more on how to use this command. === Manually Finding Input Files === If you can't use <code>toil debug-job</code>, you might need to manually dig through the job store for files. In the log of your failing Toil task, look for lines like this: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-4f886176ab8344baaf17dc72fc445445/toplog.sh' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmprwhi6h3q/toplog.sh' [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmpjyksfoko/Sample.bam' ... The <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam</code> part is a Toil file ID, and it is a relative path from your <code>--jobStore</code> value to where the file is stored on disk. So if you ran the workflow with <code>--jobStore /private/groups/patenlab/anovak/jobstore</code>, you would look for this file at: /private/groups/patenlab/anovak/jobstore/files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam ==More Ways of Finding Files== Sometimes, a step might not fail, but you still might want to see the files it is using as input. If you have the job store path, you can use the <code>find</code> command to try and find the files by name. For example, if you want to look at <code>Sample.bam</code>, you can look for it like this: find /path/to/the/jobstore -name "Sample.bam" If you want to find files that were ''uploaded'' from a job, look for lines like this in the job's log: [2023-07-16T15:58:39-0700] [MainThread] [D] [toil.wdl.wdltoil] Virtualized /data/tmp/2846b6012e3e5535add03b363950dd78/cb23/197c/work/bamPerChrs/Sample.chr14.bam as WDL file toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam You can take the <code>toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam</code> URI and URL-decode it with, for example, [https://www.urldecoder.io/], getting this: toilfile:2703483274:0:files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam Then you can take the part after the last colon, <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam</code>, and that is the path relative to the job store where this file can be found. ==Using Development Versions of Toil== Sometimes, bugs will be fixed in the development version of Toil, but not released yet. To try the current development version of Toil, you can install it like this: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl,aws,google]' If you want to use a particular branch or commit, like <code>aaa451b320fc115b3563ced25cb501301cf86f90</code>, you can do: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git@aaa451b320fc115b3563ced25cb501301cf86f90#egg=toil[wdl,aws,google]' ==Frequently Asked Questions== ===I am getting warnings about <code>XDG_RUNTIME_DIR</code>=== You may be seeing warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code> You should upgrade Toil. [https://github.com/DataBiosphere/toil/commit/ff6bf60ab798a675c20156c749817c4313644b96 Since Toil 6.1.0], Toil no longer issues this warning, and just puts up with bad <code>XDG_RUNTIME_DIR</code> settings. ===Toil said it was <code>Redirecting logging</code> somewhere, but I can't find that file!=== The Toil worker process for each job will say that it is <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code>, and when running in single machine mode these messages go to the main Toil log. The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil. If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node. =Additional WDL resources= For more information on writing and running WDL workflows, see: * [https://docs.openwdl.org/en/stable/ The WDL dcoumentation] * [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube] * [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification] c577a968ce491e06c778d3aa09a2e1b9d328e69a Genomics Institute Computing Information 0 6 510 478 2024-07-26T15:44:48Z Weiler 3 wikitext text/x-wiki Welcome to the Genomic Institute Computing Information Repository! Browse the below topics for help in the area you are curious about. == GI Public Computing Environment == *[[How to access the public servers]] == GI Firewalled Computing Environment (PRISM) == *[[Access to the Firewalled Compute Servers]] *[[Firewalled Computing Resources Overview]] *[[Firewalled Environment Storage Overview]] *[[Firewalled User Account and Storage Cost]] *[[Grafana Performance Metrics]] *[[Visual Studio Code (vscode) Configuration Tweaks]] [[http://logserv.gi.ucsc.edu/cgi-bin/private-groups.cgi|'''/private/groups''' Data Usage Graphs]] ==VPN Access== *[[Requirement for users to get GI VPN access]] == NIH dbGaP Access Requirements == *[[Requirements for dbGaP Access]] == giCloud Openstack == *[[Overview of giCloud in the Genomics Institute]] *[[Quick Start Instructions to Get Rolling with OpenStack]] == Amazon Web Services Information == *[[Overview of Getting and Using an AWS IAM Account]] *[[AWS Account List and Numbers]] *[[AWS Shared Bucket Usage Graphs]] *[[AWS Best Practices]] *[[AWS S3 Lifecycle Management]] == Slurm at the Genomics Institute == *[[Overview of using Slurm]] *[[Cluster Etiquette]] *[[Annotated Slurm Script]] *[[Job Arrays]] *[[GPU Resources]] *[[Quick Reference Guide]] *[[Slurm Queues (Partitions) and Resource Management]] *[[Slurm Tips for vg]] *[[Slurm Tips for Toil]] *[[Using Docker under Slurm]] *[[Phoenix WDL Tutorial]] ==General Docker Information== *[[Running a Container as a non-root User]] == Problems or technical support == If you have any problems with the GI computing environment, please send an email to '''cluster-admin@soe.ucsc.edu''' 73f4e9e4e801415bc20509936aa76a45b9113213 511 510 2024-07-26T15:45:05Z Weiler 3 /* GI Firewalled Computing Environment (PRISM) */ wikitext text/x-wiki Welcome to the Genomic Institute Computing Information Repository! Browse the below topics for help in the area you are curious about. == GI Public Computing Environment == *[[How to access the public servers]] == GI Firewalled Computing Environment (PRISM) == *[[Access to the Firewalled Compute Servers]] *[[Firewalled Computing Resources Overview]] *[[Firewalled Environment Storage Overview]] *[[Firewalled User Account and Storage Cost]] *[[Grafana Performance Metrics]] *[[Visual Studio Code (vscode) Configuration Tweaks]] *[[http://logserv.gi.ucsc.edu/cgi-bin/private-groups.cgi|'''/private/groups''' Data Usage Graphs]] ==VPN Access== *[[Requirement for users to get GI VPN access]] == NIH dbGaP Access Requirements == *[[Requirements for dbGaP Access]] == giCloud Openstack == *[[Overview of giCloud in the Genomics Institute]] *[[Quick Start Instructions to Get Rolling with OpenStack]] == Amazon Web Services Information == *[[Overview of Getting and Using an AWS IAM Account]] *[[AWS Account List and Numbers]] *[[AWS Shared Bucket Usage Graphs]] *[[AWS Best Practices]] *[[AWS S3 Lifecycle Management]] == Slurm at the Genomics Institute == *[[Overview of using Slurm]] *[[Cluster Etiquette]] *[[Annotated Slurm Script]] *[[Job Arrays]] *[[GPU Resources]] *[[Quick Reference Guide]] *[[Slurm Queues (Partitions) and Resource Management]] *[[Slurm Tips for vg]] *[[Slurm Tips for Toil]] *[[Using Docker under Slurm]] *[[Phoenix WDL Tutorial]] ==General Docker Information== *[[Running a Container as a non-root User]] == Problems or technical support == If you have any problems with the GI computing environment, please send an email to '''cluster-admin@soe.ucsc.edu''' c6bef10fdccbd35c1e6e89737614f6ec7eedbbe1 512 511 2024-07-26T15:45:16Z Weiler 3 /* GI Firewalled Computing Environment (PRISM) */ wikitext text/x-wiki Welcome to the Genomic Institute Computing Information Repository! Browse the below topics for help in the area you are curious about. == GI Public Computing Environment == *[[How to access the public servers]] == GI Firewalled Computing Environment (PRISM) == *[[Access to the Firewalled Compute Servers]] *[[Firewalled Computing Resources Overview]] *[[Firewalled Environment Storage Overview]] *[[Firewalled User Account and Storage Cost]] *[[Grafana Performance Metrics]] *[[Visual Studio Code (vscode) Configuration Tweaks]] *[http://logserv.gi.ucsc.edu/cgi-bin/private-groups.cgi|'''/private/groups''' Data Usage Graphs] ==VPN Access== *[[Requirement for users to get GI VPN access]] == NIH dbGaP Access Requirements == *[[Requirements for dbGaP Access]] == giCloud Openstack == *[[Overview of giCloud in the Genomics Institute]] *[[Quick Start Instructions to Get Rolling with OpenStack]] == Amazon Web Services Information == *[[Overview of Getting and Using an AWS IAM Account]] *[[AWS Account List and Numbers]] *[[AWS Shared Bucket Usage Graphs]] *[[AWS Best Practices]] *[[AWS S3 Lifecycle Management]] == Slurm at the Genomics Institute == *[[Overview of using Slurm]] *[[Cluster Etiquette]] *[[Annotated Slurm Script]] *[[Job Arrays]] *[[GPU Resources]] *[[Quick Reference Guide]] *[[Slurm Queues (Partitions) and Resource Management]] *[[Slurm Tips for vg]] *[[Slurm Tips for Toil]] *[[Using Docker under Slurm]] *[[Phoenix WDL Tutorial]] ==General Docker Information== *[[Running a Container as a non-root User]] == Problems or technical support == If you have any problems with the GI computing environment, please send an email to '''cluster-admin@soe.ucsc.edu''' d08a965e06ef7424c2d3f35e2f7b77b5065f8ba2 513 512 2024-07-26T15:45:51Z Weiler 3 /* GI Firewalled Computing Environment (PRISM) */ wikitext text/x-wiki Welcome to the Genomic Institute Computing Information Repository! Browse the below topics for help in the area you are curious about. == GI Public Computing Environment == *[[How to access the public servers]] == GI Firewalled Computing Environment (PRISM) == *[[Access to the Firewalled Compute Servers]] *[[Firewalled Computing Resources Overview]] *[[Firewalled Environment Storage Overview]] *[[Firewalled User Account and Storage Cost]] *[[Grafana Performance Metrics]] *[[Visual Studio Code (vscode) Configuration Tweaks]] *[http://logserv.gi.ucsc.edu/cgi-bin/private-groups.cgi '''/private/groups''' Data Usage Graphs] ==VPN Access== *[[Requirement for users to get GI VPN access]] == NIH dbGaP Access Requirements == *[[Requirements for dbGaP Access]] == giCloud Openstack == *[[Overview of giCloud in the Genomics Institute]] *[[Quick Start Instructions to Get Rolling with OpenStack]] == Amazon Web Services Information == *[[Overview of Getting and Using an AWS IAM Account]] *[[AWS Account List and Numbers]] *[[AWS Shared Bucket Usage Graphs]] *[[AWS Best Practices]] *[[AWS S3 Lifecycle Management]] == Slurm at the Genomics Institute == *[[Overview of using Slurm]] *[[Cluster Etiquette]] *[[Annotated Slurm Script]] *[[Job Arrays]] *[[GPU Resources]] *[[Quick Reference Guide]] *[[Slurm Queues (Partitions) and Resource Management]] *[[Slurm Tips for vg]] *[[Slurm Tips for Toil]] *[[Using Docker under Slurm]] *[[Phoenix WDL Tutorial]] ==General Docker Information== *[[Running a Container as a non-root User]] == Problems or technical support == If you have any problems with the GI computing environment, please send an email to '''cluster-admin@soe.ucsc.edu''' 8088c00396e384cf1f6571389e89133104feaed9 518 513 2024-09-28T16:35:51Z Weiler 3 /* Slurm at the Genomics Institute */ wikitext text/x-wiki Welcome to the Genomic Institute Computing Information Repository! Browse the below topics for help in the area you are curious about. == GI Public Computing Environment == *[[How to access the public servers]] == GI Firewalled Computing Environment (PRISM) == *[[Access to the Firewalled Compute Servers]] *[[Firewalled Computing Resources Overview]] *[[Firewalled Environment Storage Overview]] *[[Firewalled User Account and Storage Cost]] *[[Grafana Performance Metrics]] *[[Visual Studio Code (vscode) Configuration Tweaks]] *[http://logserv.gi.ucsc.edu/cgi-bin/private-groups.cgi '''/private/groups''' Data Usage Graphs] ==VPN Access== *[[Requirement for users to get GI VPN access]] == NIH dbGaP Access Requirements == *[[Requirements for dbGaP Access]] == giCloud Openstack == *[[Overview of giCloud in the Genomics Institute]] *[[Quick Start Instructions to Get Rolling with OpenStack]] == Amazon Web Services Information == *[[Overview of Getting and Using an AWS IAM Account]] *[[AWS Account List and Numbers]] *[[AWS Shared Bucket Usage Graphs]] *[[AWS Best Practices]] *[[AWS S3 Lifecycle Management]] == Slurm at the Genomics Institute == *[[Overview of using Slurm]] *[[Cluster Etiquette]] *[[Annotated Slurm Script]] *[[Job Arrays]] *[[GPU Resources]] *[[Quick Reference Guide]] *[[Convenient Slurm Commands]] *[[Slurm Queues (Partitions) and Resource Management]] *[[Slurm Tips for vg]] *[[Slurm Tips for Toil]] *[[Using Docker under Slurm]] *[[Phoenix WDL Tutorial]] ==General Docker Information== *[[Running a Container as a non-root User]] == Problems or technical support == If you have any problems with the GI computing environment, please send an email to '''cluster-admin@soe.ucsc.edu''' 8489b79202db3e917d07185ee20a98a6c8ddf72e 542 518 2025-01-19T22:27:19Z Weiler 3 /* GI Firewalled Computing Environment (PRISM) */ wikitext text/x-wiki Welcome to the Genomic Institute Computing Information Repository! Browse the below topics for help in the area you are curious about. == GI Public Computing Environment == *[[How to access the public servers]] == GI Firewalled Computing Environment (PRISM) == *[[Access to the Firewalled Compute Servers]] *[[Firewalled Computing Resources Overview]] *[[Firewalled Environment Storage Overview]] *[[Firewalled User Account and Storage Cost]] *[[Grafana Performance Metrics]] *[[Visual Studio Code (vscode) Configuration Tweaks]] *[http://logserv.gi.ucsc.edu/cgi-bin/private-groups.cgi '''/private/groups''' Data Usage Graphs] *[[Resetting your VPN/PRISM Password]] ==VPN Access== *[[Requirement for users to get GI VPN access]] == NIH dbGaP Access Requirements == *[[Requirements for dbGaP Access]] == giCloud Openstack == *[[Overview of giCloud in the Genomics Institute]] *[[Quick Start Instructions to Get Rolling with OpenStack]] == Amazon Web Services Information == *[[Overview of Getting and Using an AWS IAM Account]] *[[AWS Account List and Numbers]] *[[AWS Shared Bucket Usage Graphs]] *[[AWS Best Practices]] *[[AWS S3 Lifecycle Management]] == Slurm at the Genomics Institute == *[[Overview of using Slurm]] *[[Cluster Etiquette]] *[[Annotated Slurm Script]] *[[Job Arrays]] *[[GPU Resources]] *[[Quick Reference Guide]] *[[Convenient Slurm Commands]] *[[Slurm Queues (Partitions) and Resource Management]] *[[Slurm Tips for vg]] *[[Slurm Tips for Toil]] *[[Using Docker under Slurm]] *[[Phoenix WDL Tutorial]] ==General Docker Information== *[[Running a Container as a non-root User]] == Problems or technical support == If you have any problems with the GI computing environment, please send an email to '''cluster-admin@soe.ucsc.edu''' 30b668a112aeb11f8370cf18ccc4fe760344ec05 Cluster Etiquette 0 47 514 483 2024-08-02T17:28:23Z Weiler 3 wikitext text/x-wiki When running jobs on the cluster, you must be very aware of how those jobs will affect other users. 1: Always test your job by running one first. Just one. Note how much RAM, how many CPU cores and how much time it takes to run. Then, when you submit 50 or 100 of those, you can specify limits in your Slurm batch file on how long the job should run, how much RAM it should use and how much time it takes. In that case, slurm can stop jobs that inadvertently go too long or use too many resources. 2: Don't run too many jobs at once if they use a lot of disk I/O. If every job reads in a 100GB file, and you launch 20 of them at the same time, you could bring down the /private/groups filesystem. Run only maybe 5 at once in that case, or introduce a random delay at the start of your jobs. You can limit your concurrent jobs by specifying something like this in your job batch file: #SBATCH --array=[1-279]%10 inputList=$1 input=$(sed -n "$SLURM_ARRAY_TASK_ID"p $inputList) some_command $input 3: Please do not pin cluster resources with interactive jobs and let them sit idle. Sometimes folks will open an interactive cluster job with a week long runtime and just let it sit in order to "hold" a spot in the queue for when they might eventually want to run something. This is a waste of resources, and it also forces other who have work ready to go to wait in the queue while nodes sit idle. If you use an interactive job via '''srun''' or '''salloc''', please start it immediately upon launch and close it immediately upon the job's completion. 4: Don't use too much storage. Use http://logserv.gi.ucsc.edu/cgi-bin/private-groups.cgi to look at how your storage use is divided among your directories, and clean up large chunks of data that you do not need. fffe83f529aad9fcb81d089f4bb91b88746fb9eb File:Ucsc gi private diagram.png 6 54 515 2024-09-25T16:29:08Z Weiler 3 wikitext text/x-wiki da39a3ee5e6b4b0d3255bfef95601890afd80709 Firewalled Computing Resources Overview 0 41 516 502 2024-09-25T16:31:07Z Weiler 3 wikitext text/x-wiki == Doing Work and Computing == When doing research, running jobs and the like, please be careful of your resource consumption on the server you are on. Don't run too many threads or cores at once if such a thing overruns the RAM available or the disk IO available. If you are not sure of your potential RAM, CPU or disk impact, start small with one or two processes and work your way up from there. Also, before running your stuff, check what else is already happening on the server by using the 'top' command to see who else and what else is running and what kind of resources are already being consumed. If, after starting a process, you realize that the server slows down considerably or becomes unusable, kill your processes and re-evaluate what you need to make things work. These servers are shared resources - be a good neighbor! == Server Types and Management == After confirming your VPN software is working, you can ssh into one of the shared compute servers behind the VPN. The DNS suffix for all machines is ".prism". So, "mustard" would have a full DNS name of "mustard.prism": {| class="wikitable" style="text-align:center;" |- style="font-weight:bold; text-align:left;" ! Node Name ! Operating System<br /> ! CPU Cores ! Memory ! Network Bandwidth ! Scratch Space |- | style="text-align:left;" | mustard | style="text-align:left;" | Ubuntu 22.04 | 160 | 1.5 TB | 10 Gb/s | 9 TB |- | style="text-align:left;" | emerald | style="text-align:left;" | Ubuntu 22.04 | 64 | 1 TB | 10 Gb/s | 690 GB |- | style="text-align:left;" | crimson | style="text-align:left;" | Ubuntu 22.04 | 32 | 256 GB | 10 Gb/s | 5.5 TB |- | style="text-align:left;" | razzmatazz | style="text-align:left;" | Ubuntu 22.04 | 32 | 256 GB | 10 Gb/s | 5.5 TB |} These ''shared'' servers are managed by the Genomics Institute Cluster Admin group. If you need software installed on any of these servers, please make your request by emailing cluster-admin@soe.ucsc.edu. == The Firewall == All servers are behind a firewall in this environment, and as such, you must connect to the VPN in order to access them. They will not be accessible from the greater Internet without VPN. Although you will be able to connect outbound from them to other servers on the internet to copy data in, sync git repos, stuff like that. It is only inbound connections that will be blocked. All machines behind the firewall have the private domain name suffix of "*.prism". == The Phoenix Cluster == This is a cluster of 25 Ubuntu 22.04 nodes, some of which have GPUs in them. Each node generally has about 2TB RAM and 256 cores, although the cluster is heterogeneous and has multiple node types. You interact with the Phoenix Cluster via the Slurm Job Scheduler. You must specifically request access to use Slurm on the Phoenix Cluster, just email '''cluster-admin@soe.ucsc.edu''' for access. {| class="wikitable" style="text-align:center;" |- style="font-weight:bold;" ! Node Name ! Operating System<br /> ! CPU Cores ! GPUs/Type ! Memory ! Network Bandwidth ! Scratch Space |- | style="text-align:left;" | phoenix-00 | style="text-align:left;" | Ubuntu 22.04 | 256 | 8 / Nvidia A100 | 1 TB | 10 Gb/s | 16 TB NVMe |- | style="text-align:left;" | phoenix-[01-05] | style="text-align:left;" | Ubuntu 22.04 | 256 | 8 / Nvidia A5500 | 2 TB | 10 Gb/s | 16 TB NVMe |- | style="text-align:left;" | phoenix-[06-08] | style="text-align:left;" | Ubuntu 22.04 | 256 | N/A | 2 TB | 10 Gb/s | 16 TB NVMe |- | style="text-align:left;" | phoenix-[09-10] | style="text-align:left;" | Ubuntu 22.04 | 384 | N/A | 2.3 TB | 10 Gb/s | 16 TB NVMe |- | style="text-align:left;" | phoenix-[11-21] | style="text-align:left;" | Ubuntu 22.04 | 256 | N/A | 2 TB | 10 Gb/s | 16 TB NVMe |- | style="text-align:left;" | phoenix-[22-24] | style="text-align:left;" | Ubuntu 22.04 | 384 | N/A | 2.3 TB | 10 Gb/s | 16 TB NVMe |} The cluster head node is '''phoenix.prism'''. However, you cannot directly login to phoenix.prism in order to protect the scheduler from errant or runaway jobs there, so jobs can be submitted from any interactive compute server (mustard, emerald, razzmatazz or crimson). To learn more about how to use Slurm, refer to: https://giwiki.gi.ucsc.edu/index.php/Genomics_Institute_Computing_Information#Slurm_at_the_Genomics_Institute For scratch on the cluster, TMPDIR will be set to /data/tmp (which is local to each cluster node). That area is cleaned often so don't store any data there that isn't being used by your jobs. ==Graphical Diagram of the Firewalled Area== This is a general representation of how things look: [[File:grafana_menu.png|900px]] a46ad6b35ffe7f810d57a72ad20b9ef83bd374b6 517 516 2024-09-25T16:31:50Z Weiler 3 wikitext text/x-wiki == Doing Work and Computing == When doing research, running jobs and the like, please be careful of your resource consumption on the server you are on. Don't run too many threads or cores at once if such a thing overruns the RAM available or the disk IO available. If you are not sure of your potential RAM, CPU or disk impact, start small with one or two processes and work your way up from there. Also, before running your stuff, check what else is already happening on the server by using the 'top' command to see who else and what else is running and what kind of resources are already being consumed. If, after starting a process, you realize that the server slows down considerably or becomes unusable, kill your processes and re-evaluate what you need to make things work. These servers are shared resources - be a good neighbor! == Server Types and Management == After confirming your VPN software is working, you can ssh into one of the shared compute servers behind the VPN. The DNS suffix for all machines is ".prism". So, "mustard" would have a full DNS name of "mustard.prism": {| class="wikitable" style="text-align:center;" |- style="font-weight:bold; text-align:left;" ! Node Name ! Operating System<br /> ! CPU Cores ! Memory ! Network Bandwidth ! Scratch Space |- | style="text-align:left;" | mustard | style="text-align:left;" | Ubuntu 22.04 | 160 | 1.5 TB | 10 Gb/s | 9 TB |- | style="text-align:left;" | emerald | style="text-align:left;" | Ubuntu 22.04 | 64 | 1 TB | 10 Gb/s | 690 GB |- | style="text-align:left;" | crimson | style="text-align:left;" | Ubuntu 22.04 | 32 | 256 GB | 10 Gb/s | 5.5 TB |- | style="text-align:left;" | razzmatazz | style="text-align:left;" | Ubuntu 22.04 | 32 | 256 GB | 10 Gb/s | 5.5 TB |} These ''shared'' servers are managed by the Genomics Institute Cluster Admin group. If you need software installed on any of these servers, please make your request by emailing cluster-admin@soe.ucsc.edu. == The Firewall == All servers are behind a firewall in this environment, and as such, you must connect to the VPN in order to access them. They will not be accessible from the greater Internet without VPN. Although you will be able to connect outbound from them to other servers on the internet to copy data in, sync git repos, stuff like that. It is only inbound connections that will be blocked. All machines behind the firewall have the private domain name suffix of "*.prism". == The Phoenix Cluster == This is a cluster of 25 Ubuntu 22.04 nodes, some of which have GPUs in them. Each node generally has about 2TB RAM and 256 cores, although the cluster is heterogeneous and has multiple node types. You interact with the Phoenix Cluster via the Slurm Job Scheduler. You must specifically request access to use Slurm on the Phoenix Cluster, just email '''cluster-admin@soe.ucsc.edu''' for access. {| class="wikitable" style="text-align:center;" |- style="font-weight:bold;" ! Node Name ! Operating System<br /> ! CPU Cores ! GPUs/Type ! Memory ! Network Bandwidth ! Scratch Space |- | style="text-align:left;" | phoenix-00 | style="text-align:left;" | Ubuntu 22.04 | 256 | 8 / Nvidia A100 | 1 TB | 10 Gb/s | 16 TB NVMe |- | style="text-align:left;" | phoenix-[01-05] | style="text-align:left;" | Ubuntu 22.04 | 256 | 8 / Nvidia A5500 | 2 TB | 10 Gb/s | 16 TB NVMe |- | style="text-align:left;" | phoenix-[06-08] | style="text-align:left;" | Ubuntu 22.04 | 256 | N/A | 2 TB | 10 Gb/s | 16 TB NVMe |- | style="text-align:left;" | phoenix-[09-10] | style="text-align:left;" | Ubuntu 22.04 | 384 | N/A | 2.3 TB | 10 Gb/s | 16 TB NVMe |- | style="text-align:left;" | phoenix-[11-21] | style="text-align:left;" | Ubuntu 22.04 | 256 | N/A | 2 TB | 10 Gb/s | 16 TB NVMe |- | style="text-align:left;" | phoenix-[22-24] | style="text-align:left;" | Ubuntu 22.04 | 384 | N/A | 2.3 TB | 10 Gb/s | 16 TB NVMe |} The cluster head node is '''phoenix.prism'''. However, you cannot directly login to phoenix.prism in order to protect the scheduler from errant or runaway jobs there, so jobs can be submitted from any interactive compute server (mustard, emerald, razzmatazz or crimson). To learn more about how to use Slurm, refer to: https://giwiki.gi.ucsc.edu/index.php/Genomics_Institute_Computing_Information#Slurm_at_the_Genomics_Institute For scratch on the cluster, TMPDIR will be set to /data/tmp (which is local to each cluster node). That area is cleaned often so don't store any data there that isn't being used by your jobs. ==Graphical Diagram of the Firewalled Area== This is a general representation of how things look: [[File:Ucsc_gi_private_diagram.png|900px]] c36570b83bd42e277f37231cff0cb9b45f8542c6 Convenient Slurm Commands 0 55 519 2024-09-28T16:40:31Z Weiler 3 Created page with "__TOC__ ==General commands== Get documentation on a command: man <command> Try the following commands: man sbatch man squeue man scancel ==Submitting Jobs== The following example script specifies a partition, time limit, memory allocation and number of cores. All your scripts should specify values for these four parameters. You can also set additional parameters as shown, such as jobname and output file. For This script performs a simple task — it generates of..." wikitext text/x-wiki __TOC__ ==General commands== Get documentation on a command: man <command> Try the following commands: man sbatch man squeue man scancel ==Submitting Jobs== The following example script specifies a partition, time limit, memory allocation and number of cores. All your scripts should specify values for these four parameters. You can also set additional parameters as shown, such as jobname and output file. For This script performs a simple task — it generates of file of random numbers and then sorts it. A detailed explanation the script is available here. #!/bin/bash # #SBATCH -p shared # partition (queue) #SBATCH -c 1 # number of cores #SBATCH --mem 100 # memory pool for all cores #SBATCH -t 0-2:00 # time (D-HH:MM) #SBATCH -o slurm.%N.%j.out # STDOUT #SBATCH -e slurm.%N.%j.err # STDERR for i in {1..100000}; do echo $RANDOM >> SomeRandomNumbers.txt donesort SomeRandomNumbers.txt Now you can submit your job with the command: sbatch myscript.sh If you want to test your job and find out when your job is estimated to run use (note this does not actually submit the job): sbatch --test-only myscript.sh ==Information On Jobs== List all current jobs for a user: squeue -u <username> List all running jobs for a user: squeue -u <username> -t RUNNING List all pending jobs for a user: squeue -u <username> -t PENDING List priority order of jobs for the current user (you) in a given partition: showq-slurm -o -u -q <partition> List jobs run by the current user since a certain date: sacct --starttime <YYYY-MM-DD> List jobs run by a user during an interval marked by a start, -S, and an end, -E, date along with the information on the job id, the allocated node, partition, number of allocated CPUs, state of the job, and the start time of the job: sacct -S <YYYY-MM-DD> -E <YYYY-MM-DD> -u <username> --format=JobID,nodelist,Partition,AllocCPUs,State,start If the end date is left out, then the sacct command will list the jobs starting from the start date until now. List detailed information for a currently running job (useful for troubleshooting): scontrol show jobid -dd <jobid> List status info for a currently running job: sstat --format=AveCPU,AvePages,AveRSS,AveVMSize,JobID -j <jobid> --allsteps To view the command line argument at the time of submission of a job: sacct -j <jobid> -o submitline -P To see the batch script of a submitted job: sacct -j <jobid> --batch Once your job has completed, you can get additional information that was not available during the run. This includes run time, memory used, etc. To get statistics on both completed jobs and currently running jobs by jobID: sacct -j <jobid> --format=JobID,JobName,MaxRSS,Elapsed,nodelist -X To view the same information for all jobs of a user: sacct -u <username> --format=JobID,JobName,MaxRSS,Elapsed ==Controlling jobs== To cancel one job: scancel <jobid> To cancel all the jobs for a user: scancel -u <username> To cancel all the pending jobs for a user: scancel -t PENDING -u <username> To cancel one or more jobs by name: scancel --name myJobName To hold a particular job from being scheduled: scontrol hold <jobid> To release a particular job to be scheduled: scontrol release <jobid> To requeue (cancel and rerun) a particular job: scontrol requeue <jobid> ==Job Arrays and Useful Commands== As shown in the commands above, its easy to refer to one job by its Job ID, or to all your jobs via your username. What if you want to refer to a subset of your jobs? The answer is to submit your job set as a job array. Then you can use the job array ID to refer to the set when running SLURM commands. SLURM job arrays To cancel an indexed job in a job array: scancel <jobid>_<index> e.g. scancel 1234_4 To find the original submit time for your job array sacct -j 32532756 -o submit -X --noheader | uniq ==Advanced (but useful!) Commands== The following commands work for individual jobs and for job arrays, and allow easy manipulation of large numbers of jobs. You can combine these commands with the parameters shown above to provide great flexibility and precision in job control. (Note that all of these commands are entered on one line) Suspend all running jobs for a user (takes into account job arrays): squeue -ho %A -t R | xargs -n 1 scontrol suspend Resume all suspended jobs for a user: squeue -o "%.18A %.18t" -u <username> | awk '{if ($2 =="S"){print $1}}' | xargs -n 1 scontrol resume After resuming, check if any are still suspended: squeue -ho %A -u $USER -t S | wc -l 3c945f8d1b0193e02f3df778179d8f85c4af1fdf Slurm Tips for vg 0 37 520 455 2024-10-11T17:05:25Z Anovak 4 /* Misc Tips */ Discourage interactive shells wikitext text/x-wiki This page explains how to set up a development environment for [https://github.com/vgteam/vg vg] on the Phoenix cluster. ==Setting Up== 1. After connecting to the VPN, connect to the cluster head node: ssh phoenix.prism This node is relatively small, so you shouldn't run real work on it, but it is the place you need to be to submit Slurm jobs. 2. Make yourself a user directory under '''/private/groups''', which is where large data must be stored. For example, if you are in the Paten lab: mkdir /private/groups/patenlab/$USER 3. (Optional) Link it over to your home directory, so it is easy to use storage there to store your repos. The '''/private/groups''' storage may be faster than the home directory storage. mkdir -p /private/groups/patenlab/$USER/workspace ln -s /private/groups/patenlab/$USER/workspace ~/workspace 4. Make sure you have SSH keys created and add them to Github. cat ~/.ssh/id_ed25519.pub || (ssh-keygen -t ed25519 && cat ~/.ssh/id_ed25519.pub) # Paste into https://github.com/settings/ssh/new 5. Make a place to put your clone, and clone vg: mkdir -p ~/workspace cd ~/workspace git clone --recursive git@github.com:vgteam/vg.git cd vg 6. vg's dependencies should already be installed on the cluster nodes. If any of them seem to be missing, tell cluster-admin@soe.ucsc.edu to install them. 7. Build vg as a Slurm job. This will send the build out to the cluster as a 64-core, 80G memory job, and keep the output logs in your terminal. srun -c 64 --mem=80G --time=00:30:00 make -j64 This will leave your vg binary at '''~/workspace/vg/bin/vg'''. ==Misc Tips== * For a lightweight job that outputs to your terminal or that can be waited for in a Bash script, run an individual command directly from <code>srun</code>: srun -c1 --mem 2G --partition short --time 1:00:00 sleep 10 * If you need to run a few commands in the same shell, use <code>sbatch --wrap</code>: sbatch -c1 --mem 2G --partition short --time 1:00:00 --wrap ". venv/bin/activate; ./script1.py && ./script2.py" * To watch a batch job's output live, look at the <code>Submitted batch job 5244464</code> line from <code>sbatch</code> and run: tail -f slurm-5244464.out * '''Danger!''' If you ''really'' need an interactive session with appreciable resources, you can schedule one with <code>srun --pty</code>. But it is '''very easy''' to waste resources like this, since the job will happily sit there not doing anything until it hits the timeout. Only do this for testing! For real work, use one of the other methods! srun -c 16 --mem 120G --time=08:00:00 --partition=medium --pty bash -i * To send out a job without making a script file for it, use '''sbatch --wrap "your command here"'''. * You can use arguments from SBATCH lines on the command line! * You can use [https://github.com/CLIP-HPC/SlurmCommander#readme Slurm Commander] to watch the state of the cluster with the '''scom''' command. cf2daaa74f40e9df314286dd7b7fd5cae5a49442 540 520 2025-01-16T16:58:39Z Anovak 4 /* Setting Up */ wikitext text/x-wiki This page explains how to set up a development environment for [https://github.com/vgteam/vg vg] on the Phoenix cluster. ==Setting Up== 1. After connecting to the VPN, connect to an interactive node: ssh razzmatazz.prism This node is relatively small, so you shouldn't run real work on it, but it is the place you need to be to submit Slurm jobs. 2. Make yourself a user directory under '''/private/groups''', which is where large data must be stored. For example, if you are in the Paten lab: mkdir /private/groups/patenlab/$USER 3. (Optional) Link it over to your home directory, so it is easy to use storage there to store your repos. The '''/private/groups''' storage may be faster than the home directory storage. mkdir -p /private/groups/patenlab/$USER/workspace ln -s /private/groups/patenlab/$USER/workspace ~/workspace 4. Make sure you have SSH keys created and add them to Github. cat ~/.ssh/id_ed25519.pub || (ssh-keygen -t ed25519 && cat ~/.ssh/id_ed25519.pub) # Paste into https://github.com/settings/ssh/new 5. Make a place to put your clone, and clone vg: mkdir -p ~/workspace cd ~/workspace git clone --recursive git@github.com:vgteam/vg.git cd vg 6. vg's dependencies should already be installed on the cluster nodes. If any of them seem to be missing, tell cluster-admin@soe.ucsc.edu to install them. 7. Build vg as a Slurm job. This will send the build out to the cluster as a 64-core, 80G memory job, and keep the output logs in your terminal. srun -c 64 --mem=80G --time=00:30:00 make -j64 This will leave your vg binary at '''~/workspace/vg/bin/vg'''. ==Misc Tips== * For a lightweight job that outputs to your terminal or that can be waited for in a Bash script, run an individual command directly from <code>srun</code>: srun -c1 --mem 2G --partition short --time 1:00:00 sleep 10 * If you need to run a few commands in the same shell, use <code>sbatch --wrap</code>: sbatch -c1 --mem 2G --partition short --time 1:00:00 --wrap ". venv/bin/activate; ./script1.py && ./script2.py" * To watch a batch job's output live, look at the <code>Submitted batch job 5244464</code> line from <code>sbatch</code> and run: tail -f slurm-5244464.out * '''Danger!''' If you ''really'' need an interactive session with appreciable resources, you can schedule one with <code>srun --pty</code>. But it is '''very easy''' to waste resources like this, since the job will happily sit there not doing anything until it hits the timeout. Only do this for testing! For real work, use one of the other methods! srun -c 16 --mem 120G --time=08:00:00 --partition=medium --pty bash -i * To send out a job without making a script file for it, use '''sbatch --wrap "your command here"'''. * You can use arguments from SBATCH lines on the command line! * You can use [https://github.com/CLIP-HPC/SlurmCommander#readme Slurm Commander] to watch the state of the cluster with the '''scom''' command. e2c35ab93bae6e4cdda04a0dbe7e53ffd387267c AWS Account List and Numbers 0 22 521 488 2024-10-11T20:24:54Z Weiler 3 wikitext text/x-wiki This is a list of our currently available AWS accounts and their account numbers: ucsc-bd2k : 862902209576 ucsc-toil-dev : 318423852362 ucsc-vg-dev : 781907127277 ucsc-platform-dev : 719818754276 comparative-genomics-dev : 162786355865 nanopore-dev : 270442831226 ucsc-cgp-production : 097093801910 (Removed) platform-hca-dev : 122796619775 anvil-dev : 608666466534 (Removed) gi-gateway : 652235167018 pangenomics : 422448306679 braingeneers : 443872533066 ucsctreehouse : 238605363322 ucsc-bisti-dev : 851631505710 (Removed) ucsc-genome-browser : 784962239183 dockstore-dev : 635220370222 ucsc-spatial : 541180793903 (Removed) platform-hca-prod : 542754589326 platform-hca-portal : 158963592881 miga-lab : 156518225147 platform-anvil-dev : 289950828509 platform-anvil-prod : 465330168186 platform-anvil-portal : 166384485414 platform-temp-dev : 654654270592 agc-runs : 598929688444 sequencing-center-cold-store : 436140841220 hprc-training : 654654365441 e831e055591fb888834bbd0b90d00749b9db1aab TFL 0 56 522 2024-10-13T15:06:55Z Weiler 3 Created page with "__TOC__ Here are the instructions to prepare! Let me know if anyone has any issues, we can work through it. 1: Find the "Terminal" application on your Mac. Every Mac has it. You can do a search for it in "spotlight" (the little magnifying glass on the top right of every Mac Desktop screen). It's also in your applications folder (in the "Utilities" sub folder). Once you find it, drag the application icon to your dock to make a shortcut for it." wikitext text/x-wiki __TOC__ Here are the instructions to prepare! Let me know if anyone has any issues, we can work through it. 1: Find the "Terminal" application on your Mac. Every Mac has it. You can do a search for it in "spotlight" (the little magnifying glass on the top right of every Mac Desktop screen). It's also in your applications folder (in the "Utilities" sub folder). Once you find it, drag the application icon to your dock to make a shortcut for it. 4a3ae6f0989f3071d12475ccc8a8ff1d05125bed 524 522 2024-10-13T15:09:59Z Weiler 3 wikitext text/x-wiki __TOC__ Here are the instructions to prepare! Let me know if anyone has any issues, we can work through it. 1: Find the "Terminal" application on your Mac. Every Mac has it. You can do a search for it in "spotlight" (the little magnifying glass on the top right of every Mac Desktop screen). It's also in your applications folder (in the "Utilities" sub folder). Once you find it, drag the application icon to your dock to make a shortcut for it. [[File:Terminal.png|900px]] c01580f32e7e7280066ee874eb565414b61e77e9 525 524 2024-10-13T15:22:10Z Weiler 3 wikitext text/x-wiki __TOC__ Here are the instructions to prepare! Let me know if anyone has any issues, we can work through it. == Download the Bot Script that I Emailed You == It should be called '''reserve_tfl_bot.py'''. Save this to your Desktop (which is located as '''/Users/your_user_name/Desktop'''). == Find the Terminal Application == Find the "Terminal" application on your Mac. Every Mac has it. You can do a search for it in "spotlight" (the little magnifying glass on the top right of every Mac Desktop screen). It's also in your applications folder (in the "Utilities" sub folder). Once you find it, drag the application icon to your dock to make a shortcut for it. [[File:Terminal.png|900px]] == Prepare the '''reserve_tfl_bot.py''' Script == Once you have the terminal open, it will look like a white box waiting for you to type in text. This is what is called a "UNIX command line". There are only a couple things you will need to type in here to prepare. The first thing you should do in the terminal is navigate to the Desktop. This basically changes your current working directory to be your Desktop in the terminal: cd Desktop Then run this command: chmod 755 ./reserve_tfl_boy.py That second command makes the script '''executable'''. Which means we can now run the program. Both of those commands won't really give you feedback, then will simply work and not give you any errors hopefully. The next thing is that you should type in '''pip3 install selenium'''. 83baa305df7d3b65f8a363abef0d33b4239a5b16 527 525 2024-10-13T15:23:07Z Weiler 3 wikitext text/x-wiki __TOC__ Here are the instructions to prepare! Let me know if anyone has any issues, we can work through it. == Download the Bot Script that I Emailed You == It should be called '''reserve_tfl_bot.py'''. Save this to your Desktop (which is located as '''/Users/your_user_name/Desktop'''). == Find the Terminal Application == Find the "Terminal" application on your Mac. Every Mac has it. You can do a search for it in "spotlight" (the little magnifying glass on the top right of every Mac Desktop screen). It's also in your applications folder (in the "Utilities" sub folder). Once you find it, drag the application icon to your dock to make a shortcut for it. [[File:Terminal.png|900px]] == Prepare the '''reserve_tfl_bot.py''' Script == Once you have the terminal open, it will look like a white box waiting for you to type in text. This is what is called a "UNIX command line". There are only a couple things you will need to type in here to prepare. The first thing you should do in the terminal is navigate to the Desktop. This basically changes your current working directory to be your Desktop in the terminal: cd Desktop Then run this command: chmod 755 ./reserve_tfl_boy.py That second command makes the script '''executable'''. Which means we can now run the program. Both of those commands won't really give you feedback, then will simply work and not give you any errors hopefully. [[File:Term2.png|900px]] The next thing is that you should type in '''pip3 install selenium'''. 1cb4aea6b6c86e47f832f963912ca1019ab5d1bf 528 527 2024-10-13T15:28:39Z Weiler 3 wikitext text/x-wiki __TOC__ Here are the instructions to prepare! Let me know if anyone has any issues, we can work through it. == Download the Bot Script that I Emailed You == It should be called '''reserve_tfl_bot.py'''. Save this to your Desktop (which is located as '''/Users/your_user_name/Desktop'''). == Find the Terminal Application == Find the "Terminal" application on your Mac. Every Mac has it. You can do a search for it in "spotlight" (the little magnifying glass on the top right of every Mac Desktop screen). It's also in your applications folder (in the "Utilities" sub folder). Once you find it, drag the application icon to your dock to make a shortcut for it. [[File:Terminal.png|900px]] == Prepare the '''reserve_tfl_bot.py''' Script == Once you have the terminal open, it will look like a white box waiting for you to type in text. This is what is called a "UNIX command line". There are only a couple things you will need to type in here to prepare. The first thing you should do in the terminal is navigate to the Desktop. This basically changes your current working directory to be your Desktop in the terminal: cd Desktop Then run this command: chmod 755 ./reserve_tfl_boy.py That second command makes the script '''executable'''. Which means we can now run the program. Both of those commands won't really give you feedback, then will simply work and not give you any errors hopefully. [[File:Term2.png|900px]] == Install Selenium == The next thing is that you should type in '''pip3 install selenium'''. This is basically installing a Python module that will enable the script to control a Google Chrome window. It should run without any errors, and should look something like this: 6019b6f33558611b10329e4356e14e5c1b0aafd3 530 528 2024-10-13T15:29:36Z Weiler 3 wikitext text/x-wiki __TOC__ Here are the instructions to prepare! Let me know if anyone has any issues, we can work through it. == Download the Bot Script that I Emailed You == It should be called '''reserve_tfl_bot.py'''. Save this to your Desktop (which is located as '''/Users/your_user_name/Desktop'''). == Find the Terminal Application == Find the "Terminal" application on your Mac. Every Mac has it. You can do a search for it in "spotlight" (the little magnifying glass on the top right of every Mac Desktop screen). It's also in your applications folder (in the "Utilities" sub folder). Once you find it, drag the application icon to your dock to make a shortcut for it. [[File:Terminal.png|900px]] == Prepare the '''reserve_tfl_bot.py''' Script == Once you have the terminal open, it will look like a white box waiting for you to type in text. This is what is called a "UNIX command line". There are only a couple things you will need to type in here to prepare. The first thing you should do in the terminal is navigate to the Desktop. This basically changes your current working directory to be your Desktop in the terminal: cd Desktop Then run this command: chmod 755 ./reserve_tfl_boy.py That second command makes the script '''executable'''. Which means we can now run the program. Both of those commands won't really give you feedback, then will simply work and not give you any errors hopefully. [[File:Term2.png|900px]] == Install Selenium == The next thing is that you should type in '''pip3 install selenium'''. This is basically installing a Python module that will enable the script to control a Google Chrome window. It should run without any errors, and should look something like this: [[File:Selenium.png|900px]] 143aa23ac6bcc963e9c66dd67cd945ed1b00bd82 531 530 2024-10-13T15:31:12Z Weiler 3 /* Install Selenium */ wikitext text/x-wiki __TOC__ Here are the instructions to prepare! Let me know if anyone has any issues, we can work through it. == Download the Bot Script that I Emailed You == It should be called '''reserve_tfl_bot.py'''. Save this to your Desktop (which is located as '''/Users/your_user_name/Desktop'''). == Find the Terminal Application == Find the "Terminal" application on your Mac. Every Mac has it. You can do a search for it in "spotlight" (the little magnifying glass on the top right of every Mac Desktop screen). It's also in your applications folder (in the "Utilities" sub folder). Once you find it, drag the application icon to your dock to make a shortcut for it. [[File:Terminal.png|900px]] == Prepare the '''reserve_tfl_bot.py''' Script == Once you have the terminal open, it will look like a white box waiting for you to type in text. This is what is called a "UNIX command line". There are only a couple things you will need to type in here to prepare. The first thing you should do in the terminal is navigate to the Desktop. This basically changes your current working directory to be your Desktop in the terminal: cd Desktop Then run this command: chmod 755 ./reserve_tfl_boy.py That second command makes the script '''executable'''. Which means we can now run the program. Both of those commands won't really give you feedback, then will simply work and not give you any errors hopefully. [[File:Term2.png|900px]] == Install Selenium == The next thing you need to type in is '''pip3 install selenium'''. This is basically installing a Python module that will enable the script to control a Google Chrome window. It should run without any errors, and should look something like this: [[File:Selenium.png|900px]] 9f66f994629742bcee4c6f72f124df43a71f4fbc 532 531 2024-10-13T15:43:43Z Weiler 3 wikitext text/x-wiki __TOC__ Here are the instructions to prepare! Let me know if anyone has any issues, we can work through it. == Download the Bot Script that I Emailed You == It should be called '''reserve_tfl_bot.py'''. Save this to your Desktop (which is located as '''/Users/your_user_name/Desktop'''). == Find the Terminal Application == Find the "Terminal" application on your Mac. Every Mac has it. You can do a search for it in "spotlight" (the little magnifying glass on the top right of every Mac Desktop screen). It's also in your applications folder (in the "Utilities" sub folder). Once you find it, drag the application icon to your dock to make a shortcut for it. [[File:Terminal.png|900px]] == Prepare the '''reserve_tfl_bot.py''' Script == Once you have the terminal open, it will look like a white box waiting for you to type in text. This is what is called a "UNIX command line". There are only a couple things you will need to type in here to prepare. The first thing you should do in the terminal is navigate to the Desktop. This basically changes your current working directory to be your Desktop in the terminal: cd Desktop Then run this command: chmod 755 ./reserve_tfl_boy.py That second command makes the script '''executable'''. Which means we can now run the program. Both of those commands won't really give you feedback, then will simply work and not give you any errors hopefully. [[File:Term2.png|900px]] == Install Selenium == The next thing you need to type in is '''pip3 install selenium'''. This is basically installing a Python module that will enable the script to control a Google Chrome window. It should run without any errors, and should look something like this: [[File:Selenium.png|900px]] == Run the Bot! == If you have any Terminal windows open from before, let's close them and open a new one. Once you open a new Terminal window open, you can test your bot code by doing: cd Desktop ./reserve_tfl_code.py 4f4a516f7b48258d9049a728ad46d40700cfb4c7 533 532 2024-10-13T16:15:45Z Weiler 3 /* Run the Bot! */ wikitext text/x-wiki __TOC__ Here are the instructions to prepare! Let me know if anyone has any issues, we can work through it. == Download the Bot Script that I Emailed You == It should be called '''reserve_tfl_bot.py'''. Save this to your Desktop (which is located as '''/Users/your_user_name/Desktop'''). == Find the Terminal Application == Find the "Terminal" application on your Mac. Every Mac has it. You can do a search for it in "spotlight" (the little magnifying glass on the top right of every Mac Desktop screen). It's also in your applications folder (in the "Utilities" sub folder). Once you find it, drag the application icon to your dock to make a shortcut for it. [[File:Terminal.png|900px]] == Prepare the '''reserve_tfl_bot.py''' Script == Once you have the terminal open, it will look like a white box waiting for you to type in text. This is what is called a "UNIX command line". There are only a couple things you will need to type in here to prepare. The first thing you should do in the terminal is navigate to the Desktop. This basically changes your current working directory to be your Desktop in the terminal: cd Desktop Then run this command: chmod 755 ./reserve_tfl_boy.py That second command makes the script '''executable'''. Which means we can now run the program. Both of those commands won't really give you feedback, then will simply work and not give you any errors hopefully. [[File:Term2.png|900px]] == Install Selenium == The next thing you need to type in is '''pip3 install selenium'''. This is basically installing a Python module that will enable the script to control a Google Chrome window. It should run without any errors, and should look something like this: [[File:Selenium.png|900px]] == Run the Bot! == If you have any Terminal windows open from before, let's close them and open a new one. Once you open a new Terminal window, you can test your bot code by doing: cd Desktop ./reserve_tfl_code.py It will spew a bunch of text to your terminal window, but the main thing is that is should '''also''' open a Google Chrome window on the TFL reservation page, quickly check available dates and times, and if it doesn't find any, it will quickly close the Chrome window, then open a new one and repeat until it gets a hit. If it finds a time, it will then stop and wait for you to name and contact information, credit card number, etc. If you get to that point, I will send you my credit card info immediately so you can enter it. If you need to get it to stop checking, which you will need to do to stop testing, or if we "win" and get a reservation, just click on the terminal window, and hit "Control C" a bunch of times, and it should stop. When the time comes, we will all run this step about 30 seconds before 10:00am on November 1st. Then we wait and cross our fingers! fa888499966ac996a00c0fb62a4df5eceea0388d 534 533 2024-10-13T16:16:53Z Weiler 3 /* Install Selenium */ wikitext text/x-wiki __TOC__ Here are the instructions to prepare! Let me know if anyone has any issues, we can work through it. == Download the Bot Script that I Emailed You == It should be called '''reserve_tfl_bot.py'''. Save this to your Desktop (which is located as '''/Users/your_user_name/Desktop'''). == Find the Terminal Application == Find the "Terminal" application on your Mac. Every Mac has it. You can do a search for it in "spotlight" (the little magnifying glass on the top right of every Mac Desktop screen). It's also in your applications folder (in the "Utilities" sub folder). Once you find it, drag the application icon to your dock to make a shortcut for it. [[File:Terminal.png|900px]] == Prepare the '''reserve_tfl_bot.py''' Script == Once you have the terminal open, it will look like a white box waiting for you to type in text. This is what is called a "UNIX command line". There are only a couple things you will need to type in here to prepare. The first thing you should do in the terminal is navigate to the Desktop. This basically changes your current working directory to be your Desktop in the terminal: cd Desktop Then run this command: chmod 755 ./reserve_tfl_boy.py That second command makes the script '''executable'''. Which means we can now run the program. Both of those commands won't really give you feedback, then will simply work and not give you any errors hopefully. [[File:Term2.png|900px]] == Install Selenium == The next thing you need to type in is: pip3 install selenium This is basically installing a Python module that will enable the script to control a Google Chrome window. It should run without any errors, and should look something like this: [[File:Selenium.png|900px]] == Run the Bot! == If you have any Terminal windows open from before, let's close them and open a new one. Once you open a new Terminal window, you can test your bot code by doing: cd Desktop ./reserve_tfl_code.py It will spew a bunch of text to your terminal window, but the main thing is that is should '''also''' open a Google Chrome window on the TFL reservation page, quickly check available dates and times, and if it doesn't find any, it will quickly close the Chrome window, then open a new one and repeat until it gets a hit. If it finds a time, it will then stop and wait for you to name and contact information, credit card number, etc. If you get to that point, I will send you my credit card info immediately so you can enter it. If you need to get it to stop checking, which you will need to do to stop testing, or if we "win" and get a reservation, just click on the terminal window, and hit "Control C" a bunch of times, and it should stop. When the time comes, we will all run this step about 30 seconds before 10:00am on November 1st. Then we wait and cross our fingers! 816bfa80f7f47c5f954da403add671c00e1d6acc 535 534 2024-10-13T16:37:53Z Weiler 3 wikitext text/x-wiki __TOC__ Here are the instructions to prepare! Let me know if anyone has any issues, we can work through it. Make sure you have Google Chrome installed before doing these steps. == Download the Bot Script that I Emailed You == It should be called '''reserve_tfl_bot.py'''. Save this to your Desktop (which is located as '''/Users/your_user_name/Desktop'''). == Find the Terminal Application == Find the "Terminal" application on your Mac. Every Mac has it. You can do a search for it in "spotlight" (the little magnifying glass on the top right of every Mac Desktop screen). It's also in your applications folder (in the "Utilities" sub folder). Once you find it, drag the application icon to your dock to make a shortcut for it. [[File:Terminal.png|900px]] == Prepare the '''reserve_tfl_bot.py''' Script == Once you have the terminal open, it will look like a white box waiting for you to type in text. This is what is called a "UNIX command line". There are only a couple things you will need to type in here to prepare. The first thing you should do in the terminal is navigate to the Desktop. This basically changes your current working directory to be your Desktop in the terminal: cd Desktop Then run this command: chmod 755 ./reserve_tfl_boy.py That second command makes the script '''executable'''. Which means we can now run the program. Both of those commands won't really give you feedback, then will simply work and not give you any errors hopefully. [[File:Term2.png|900px]] == Install Selenium == The next thing you need to type in is: pip3 install selenium This is basically installing a Python module that will enable the script to control a Google Chrome window. It should run without any errors, and should look something like this: [[File:Selenium.png|900px]] == Run the Bot! == If you have any Terminal windows open from before, let's close them and open a new one. Once you open a new Terminal window, you can test your bot code by doing: cd Desktop ./reserve_tfl_code.py It will spew a bunch of text to your terminal window, but the main thing is that is should '''also''' open a Google Chrome window on the TFL reservation page, quickly check available dates and times, and if it doesn't find any, it will quickly close the Chrome window, then open a new one and repeat until it gets a hit. If it finds a time, it will then stop and wait for you to name and contact information, credit card number, etc. If you get to that point, I will send you my credit card info immediately so you can enter it. If you need to get it to stop checking, which you will need to do to stop testing, or if we "win" and get a reservation, just click on the terminal window, and hit "Control C" a bunch of times, and it should stop. When the time comes, we will all run this step about 30 seconds before 10:00am on November 1st. Then we wait and cross our fingers! 2c087a920a8dbf51d469ed40f84cd74505c7adc4 536 535 2024-10-24T04:23:03Z Weiler 3 wikitext text/x-wiki __TOC__ Here are the instructions to prepare! Let me know if anyone has any issues, we can work through it. Make sure you have Google Chrome installed before doing these steps. == Download the Bot Script that I Emailed You == It should be called '''reserve_tfl_bot.py'''. Save this to your Desktop (which is located as '''/Users/your_user_name/Desktop'''). == Find the Terminal Application == Find the "Terminal" application on your Mac. Every Mac has it. You can do a search for it in "spotlight" (the little magnifying glass on the top right of every Mac Desktop screen). It's also in your applications folder (in the "Utilities" sub folder). Once you find it, drag the application icon to your dock to make a shortcut for it. [[File:Terminal.png|900px]] == Prepare the '''reserve_tfl_bot.py''' Script == Once you have the terminal open, it will look like a white box waiting for you to type in text. This is what is called a "UNIX command line". There are only a couple things you will need to type in here to prepare. The first thing you should do in the terminal is navigate to the Desktop. This basically changes your current working directory to be your Desktop in the terminal: cd Desktop Then run this command: chmod 755 ./reserve_tfl_boy.py That second command makes the script '''executable'''. Which means we can now run the program. Both of those commands won't really give you feedback, then will simply work and not give you any errors hopefully. [[File:Term2.png|900px]] == Install Selenium == The next thing you need to type in is: pip3 install selenium This is basically installing a Python module that will enable the script to control a Google Chrome window. It should run without any errors, and should look something like this: [[File:Selenium.png|900px]] == Run the Bot! == If you have any Terminal windows open from before, let's close them and open a new one. Once you open a new Terminal window, you can test your bot code by doing: cd Desktop ./reserve_tfl_bot.py It will spew a bunch of text to your terminal window, but the main thing is that is should '''also''' open a Google Chrome window on the TFL reservation page, quickly check available dates and times, and if it doesn't find any, it will quickly close the Chrome window, then open a new one and repeat until it gets a hit. If it finds a time, it will then stop and wait for you to name and contact information, credit card number, etc. If you get to that point, I will send you my credit card info immediately so you can enter it. If you need to get it to stop checking, which you will need to do to stop testing, or if we "win" and get a reservation, just click on the terminal window, and hit "Control C" a bunch of times, and it should stop. When the time comes, we will all run this step about 30 seconds before 10:00am on November 1st. Then we wait and cross our fingers! 29bf564d207e8740c4ba1e18a51a3203e9aa18a9 537 536 2024-10-24T04:27:55Z Weiler 3 wikitext text/x-wiki __TOC__ Here are the instructions to prepare! Let me know if anyone has any issues, we can work through it. Make sure you have Google Chrome installed before doing these steps. == Download the Bot Script that I Emailed You == It should be called '''reserve_tfl_bot.py'''. Save this to your Desktop (which is located as '''/Users/your_user_name/Desktop'''). == Find the Terminal Application == Find the "Terminal" application on your Mac. Every Mac has it. You can do a search for it in "spotlight" (the little magnifying glass on the top right of every Mac Desktop screen). It's also in your applications folder (in the "Utilities" sub folder). Once you find it, drag the application icon to your dock to make a shortcut for it. [[File:Terminal.png|900px]] == Prepare the '''reserve_tfl_bot.py''' Script == Once you have the terminal open, it will look like a white box waiting for you to type in text. This is what is called a "UNIX command line". There are only a couple things you will need to type in here to prepare. The first thing you should do in the terminal is navigate to the Desktop. This basically changes your current working directory to be your Desktop in the terminal: cd Desktop Then run this command: chmod 755 ./reserve_tfl_bot.py That second command makes the script '''executable'''. Which means we can now run the program. Both of those commands won't really give you feedback, then will simply work and not give you any errors hopefully. [[File:Term2.png|900px]] == Install Selenium == The next thing you need to type in is: pip3 install selenium This is basically installing a Python module that will enable the script to control a Google Chrome window. It should run without any errors, and should look something like this: [[File:Selenium.png|900px]] == Run the Bot! == If you have any Terminal windows open from before, let's close them and open a new one. Once you open a new Terminal window, you can test your bot code by doing: cd Desktop ./reserve_tfl_bot.py It will spew a bunch of text to your terminal window, but the main thing is that is should '''also''' open a Google Chrome window on the TFL reservation page, quickly check available dates and times, and if it doesn't find any, it will quickly close the Chrome window, then open a new one and repeat until it gets a hit. If it finds a time, it will then stop and wait for you to name and contact information, credit card number, etc. If you get to that point, I will send you my credit card info immediately so you can enter it. If you need to get it to stop checking, which you will need to do to stop testing, or if we "win" and get a reservation, just click on the terminal window, and hit "Control C" a bunch of times, and it should stop. When the time comes, we will all run this step about 30 seconds before 10:00am on November 1st. Then we wait and cross our fingers! 0c1c1cb366e5ae54c987a7b9be2bf2926d3b4088 538 537 2024-10-24T13:50:26Z Weiler 3 wikitext text/x-wiki __TOC__ Here are the instructions to prepare! Let me know if anyone has any issues, we can work through it. Make sure you have Google Chrome installed before doing these steps. == Download the Bot Script that I Emailed You == It should be called '''reserve_tfl_bot.py'''. Save this to your Desktop (which is located as '''/Users/your_user_name/Desktop'''). == Find the Terminal Application == Find the "Terminal" application on your Mac. Every Mac has it. You can do a search for it in "spotlight" (the little magnifying glass on the top right of every Mac Desktop screen). It's also in your applications folder (in the "Utilities" sub folder). Once you find it, drag the application icon to your dock to make a shortcut for it. [[File:Terminal.png|900px]] == Prepare the '''reserve_tfl_bot.py''' Script == Once you have the terminal open, it will look like a white box waiting for you to type in text. This is what is called a "UNIX command line". There are only a couple things you will need to type in here to prepare. The first thing you should do in the terminal is navigate to the Desktop. This basically changes your current working directory to be your Desktop in the terminal: cd Desktop Then run this command: chmod 755 ./reserve_tfl_bot.py That second command makes the script '''executable'''. Which means we can now run the program. Both of those commands won't really give you feedback, then will simply work and not give you any errors hopefully. [[File:Term2.png|900px]] == Install Selenium == The next thing you need to type in is: pip3 install selenium This is basically installing a Python module that will enable the script to control a Google Chrome window. It should run without any errors, and should look something like this: [[File:Selenium.png|900px]] == Run the Bot! == If you have any Terminal windows open from before, let's close them and open a new one. Once you open a new Terminal window, you can test your bot code by doing: cd Desktop python3 ./reserve_tfl_bot.py It will spew a bunch of text to your terminal window, but the main thing is that is should '''also''' open a Google Chrome window on the TFL reservation page, quickly check available dates and times, and if it doesn't find any, it will quickly close the Chrome window, then open a new one and repeat until it gets a hit. If it finds a time, it will then stop and wait for you to name and contact information, credit card number, etc. If you get to that point, I will send you my credit card info immediately so you can enter it. If you need to get it to stop checking, which you will need to do to stop testing, or if we "win" and get a reservation, just click on the terminal window, and hit "Control C" a bunch of times, and it should stop. When the time comes, we will all run this step about 30 seconds before 10:00am on November 1st. Then we wait and cross our fingers! d36dbdccacc652cf537a12c22da74a1817a285a8 File:Terminal.png 6 57 523 2024-10-13T15:07:38Z Weiler 3 wikitext text/x-wiki da39a3ee5e6b4b0d3255bfef95601890afd80709 File:Term2.png 6 58 526 2024-10-13T15:22:37Z Weiler 3 wikitext text/x-wiki da39a3ee5e6b4b0d3255bfef95601890afd80709 File:Selenium.png 6 59 529 2024-10-13T15:28:57Z Weiler 3 wikitext text/x-wiki da39a3ee5e6b4b0d3255bfef95601890afd80709 Using Docker under Slurm 0 44 539 382 2024-12-04T14:45:05Z Anovak 4 Explain how to pass the right GPUs wikitext text/x-wiki __TOC__ Sometimes it is convenient to ask Slurm to run your job in a docker container. This is just fine, however, you will need to fully test your job in a docker container beforehand (on mustard or emerald, for example) to see how much RAM and CPU resources it requires, so you can accurately describe in your slurm job submission file how many resources it needs. == Testing == You can run your container on mustard then look at 'top' to see how much RAM and CPU it needs. You also will need to be aware that you will need to pull your docker image from a registry, like DockerHub or Quay. And you should also run you docker container with the '--rm' flag, so the container cleans itself up after running. So your workflow would look something like this: 1: Pull image from DockerHub 2: docker run --rm docker/welcome-to-docker Optionally you can clean up your image as well, but only if you don't have many jobs using that image on the same node. For example, if I wanted to remove the image laballed "weiler/mytools": $ docker image ls REPOSITORY TAG IMAGE ID CREATED SIZE weiler/mytools latest be6777ad00cf 19 hours ago 396MB somedude/tools latest 9b1d1f6fbf6f 3 weeks ago 607MB $ docker image rm be6777ad00cf == Resource Limits == When running docker containers on Slurm, slurm cannot limit the resources that docker uses. Therefore, when you launch a container, you will need to know how much resources (RAM, CPU) it uses beforehand, determined by your testing. Then launch your job with the following --cpus and --memory parameters so docker itslef will limit what it uses: docker run --rm '''--cpus=16 --memory=1024m''' docker/welcome-to-docker The --memory argument is in megabytes (hence the 'm' at the end). So the above example will set a memory limit of 1GB. == Docker and GPUs == If you are using GPUs with Docker, you need to make sure that your Docker container requests access to the ''correct'' GPUs: the ones which Slurm assigned to your job. These will be passed in the <code>SLURM_STEP_GPUS</code> (for GPUs for a single step) or <code>SLURM_JOB_GPUS</code> (for GPUs for a whole job) environment variables. They need to be passed to Docker like this: docker run --gpus="\"device=${SLURM_STEP_GPUS:-$SLURM_JOB_GPUS}\"" nvidia/cuda nvidia-smi '''Note the escaped quotes'''; the Docker command needs to have double-quotes ''inside'' the argument value. The <code>${:-}</code> syntax will use <code>SLURM_STEP_GPUS</code> if it is set and <code>SLURM_JOB_GPUS</code> if it isn't; if you know which will be set for your job, you can use just that one. If you are using Nextflow, you will need to set <code>docker.runOptions</code> to include this flag. docker.runOptions="--gpus \\\"device=$SLURM_JOB_GPUS\\\"" If you are using Toil to run CWL or WDL, the correct GPUs will be passed to containers automatically. == Cleaning Scripts == We also have auto-cleaning scripts running that will delete any containers and images that were created/pulled more than 7 days ago. This includes the cluster nodes and also the phoenix head node itself. If you need a place to have your images/containers remain longer than that, please put them on mustard, emerald, crimson or razzmatazz. Also, there are cleaning scripts in place that will destroy any running containers that have been running for over 7 days. We assume that such a container was not launched with '''--rm''' and needs to be cleaned up. 6d66722307eb115352bf6eb143c4808b5652edf5 Requirements for dbGaP Access 0 19 541 487 2025-01-19T22:26:16Z Weiler 3 wikitext text/x-wiki If you need NIH dbGaP access, there are several requirements to gaining access - please complete all these requirements '''BEFORE''' requesting dbGaP credentials. NOTE: If you already have GI VPN access to the GI "Prism" Environment, then you have already completed the requirements detailed below - let the GI Cluster Admin Group (cluster-admin@soe.ucsc.edu) know and we can quickly move to getting you set up. Please use this checklist to make sure that you have completed all '''three''' requirements. '''1'''. Your PI's info and your PI's approval '''2'''. NIH Public Security Refresher Course Certificate '''3'''. Signed NIH Genomic Data Sharing Policy Agreement '''1''': You are required to ask your PI or sponsor to email '''cluster-admin@soe.ucsc.edu''' requesting dbGaP access for you - this email should include: Your name Your PI's name PI's approval for this access '''2''': You must take the NIH Public Security Refresher Course online, then print out the Completion Certificate (which should have your name on it) at the end of the training and deliver it to the GI Grants Team. You must complete the course in a single continuous sitting in order to be able to print the certificate at the end: https://irtsectraining.nih.gov/publicUser.aspx Click on the "2020 Information Security, Counterintelligence, Privacy Awareness, Records Management Refresher" link to begin the course. At the end you will be able to print out the completion certificate that should have your name on it. '''3''': Please print, read entire NIH Genomic Data Sharing Policy agreement (link located below for download), sign the last page of the document, scan and email executed document to cluster-admin@soe.ucsc.edu with Subject Line to include: NIH GDS document. By signing the document you agree that you have read and understand the policies described therein and that you agree to abide by those policies: [[Media:NIH_GDS_Policy.pdf]] 6d146e05c1221f03c05b7f4ca296f495e1d57c26 Resetting your VPN/PRISM Password 0 60 543 2025-01-19T22:31:52Z Weiler 3 Created page with "If you have forgotten you VPN password (which is also your PRISM UNIX password), send am email to '''cluster-admin@soe.ucsc.edu''' requesting that your password be reset (include your username in the request). Once we have sent you you new temporary password, you will need to: * Log into the PRISM VPN using this new temporary password. * Log into one of the server behind the firewall (mustard, emerald, crimson or razzmatazz) using you new temporary password. * Once..." wikitext text/x-wiki If you have forgotten you VPN password (which is also your PRISM UNIX password), send am email to '''cluster-admin@soe.ucsc.edu''' requesting that your password be reset (include your username in the request). Once we have sent you you new temporary password, you will need to: * Log into the PRISM VPN using this new temporary password. * Log into one of the server behind the firewall (mustard, emerald, crimson or razzmatazz) using you new temporary password. * Once you log in there, it should ask you to type in your temporary password one more time, then it will ask you to choose a new password. Once you choose a new paasword (and type it twice for confirmation), it will log you out of your SSH session. * Log out of the VPN. * Log back into the VPN using your '''new''' password. * Log back into one of the servers (mustard, emerald, crimson or razzmatazz) using your new password. Assuming all that works, your password has been reset. You cannot reset your password to one of the prior five passwords you have used for your account. 23bad8438aecb5e7a8a63d06242ec400bca148fd 544 543 2025-01-19T22:32:10Z Weiler 3 wikitext text/x-wiki If you have forgotten you VPN password (which is also your PRISM UNIX password), send am email to '''cluster-admin@soe.ucsc.edu''' requesting that your password be reset (include your username in the request). Once we have sent you you new temporary password, you will need to: ** Log into the PRISM VPN using this new temporary password. ** Log into one of the server behind the firewall (mustard, emerald, crimson or razzmatazz) using you new temporary password. ** Once you log in there, it should ask you to type in your temporary password one more time, then it will ask you to choose a new password. Once you choose a new paasword (and type it twice for confirmation), it will log you out of your SSH session. ** Log out of the VPN. ** Log back into the VPN using your '''new''' password. ** Log back into one of the servers (mustard, emerald, crimson or razzmatazz) using your new password. Assuming all that works, your password has been reset. You cannot reset your password to one of the prior five passwords you have used for your account. 4540f7006d0c5c50f016f7af8ba4554aff9af35d 545 544 2025-01-19T22:32:33Z Weiler 3 wikitext text/x-wiki If you have forgotten you VPN password (which is also your PRISM UNIX password), send am email to '''cluster-admin@soe.ucsc.edu''' requesting that your password be reset (include your username in the request). Once we have sent you you new temporary password, you will need to: 1: Log into the PRISM VPN using this new temporary password. 2: Log into one of the server behind the firewall (mustard, emerald, crimson or razzmatazz) using you new temporary password. 3: Once you log in there, it should ask you to type in your temporary password one more time, then it will ask you to choose a new password. Once you choose a new paasword (and type it twice for confirmation), it will log you out of your SSH session. 4: Log out of the VPN. 5: Log back into the VPN using your '''new''' password. 6: Log back into one of the servers (mustard, emerald, crimson or razzmatazz) using your new password. Assuming all that works, your password has been reset. You cannot reset your password to one of the prior five passwords you have used for your account. c266fa0f32aeef814d261ddc3fa75c445852df44 546 545 2025-01-19T22:32:48Z Weiler 3 wikitext text/x-wiki If you have forgotten you VPN password (which is also your PRISM UNIX password), send am email to '''cluster-admin@soe.ucsc.edu''' requesting that your password be reset (include your username in the request). Once we have sent you your new temporary password, you will need to: 1: Log into the PRISM VPN using this new temporary password. 2: Log into one of the server behind the firewall (mustard, emerald, crimson or razzmatazz) using you new temporary password. 3: Once you log in there, it should ask you to type in your temporary password one more time, then it will ask you to choose a new password. Once you choose a new paasword (and type it twice for confirmation), it will log you out of your SSH session. 4: Log out of the VPN. 5: Log back into the VPN using your '''new''' password. 6: Log back into one of the servers (mustard, emerald, crimson or razzmatazz) using your new password. Assuming all that works, your password has been reset. You cannot reset your password to one of the prior five passwords you have used for your account. 7b8c248c0727567958e373410c2e3351b0f4a41e 547 546 2025-01-19T22:33:41Z Weiler 3 wikitext text/x-wiki If you have forgotten you VPN password (which is also your PRISM UNIX password), send am email to '''cluster-admin@soe.ucsc.edu''' requesting that your password be reset (include your username in the request). Once we have sent you your new temporary password, you will need to: 1: Log into the PRISM VPN using this new temporary password. 2: Log into one of the servers behind the firewall (mustard, emerald, crimson or razzmatazz) using you new temporary password. 3: Once you log in there, it should ask you to type in your temporary password one more time, then it will ask you to choose a new password. Once you choose a new password (and type it twice for confirmation), it will log you out of your SSH session. 4: Log out of the VPN. 5: Log back into the VPN using your '''new''' password. 6: Log back into one of the servers (mustard, emerald, crimson or razzmatazz) using your new password. Assuming all that works, your password has been reset. You cannot reset your password to one of the prior five passwords you have used for your account. 65f102a5436f56fa2ae524073be48ddb1dad6b3f 548 547 2025-01-19T22:34:29Z Weiler 3 wikitext text/x-wiki If you have forgotten you VPN password (which is also your PRISM UNIX password), send an email to '''cluster-admin@soe.ucsc.edu''' requesting that your password be reset (include your username in the request). Once we have sent you your new temporary password, you will need to: 1: Log into the PRISM VPN using this new temporary password. 2: Log into one of the servers behind the firewall (mustard, emerald, crimson or razzmatazz) using you new temporary password. 3: Once you log in there, it should ask you to type in your temporary password one more time, then it will ask you to choose a new password. Once you choose a new password (and type it twice for confirmation), it will log you out of your SSH session. 4: Log out of the VPN. 5: Log back into the VPN using your '''new''' password. 6: Log back into one of the servers (mustard, emerald, crimson or razzmatazz) using your new password. Assuming all that works, your password has been reset. You cannot reset your password to one of the prior five passwords you have used for your account. 9aefa54ad5eecba63a0260ca5970e1b7b8ad461f 549 548 2025-01-19T22:35:12Z Weiler 3 wikitext text/x-wiki If you have forgotten you VPN password (which is also your PRISM UNIX password), send an email to '''cluster-admin@soe.ucsc.edu''' requesting that your password be reset (include your username in the request). Once we have sent you your new temporary password, you will need to: 1: Log into the PRISM VPN using this new temporary password. 2: Log into one of the servers behind the firewall (mustard, emerald, crimson or razzmatazz) using you new temporary password. 3: Once you log in there, it should ask you to type in your temporary password one more time, then it will ask you to choose a new password. Once you choose a new password (and type it twice for confirmation), it will log you out of your SSH session. 4: Log (disconnect) out of the VPN. This step is very important! 5: Log back into the VPN using your '''new''' password. 6: Log back into one of the servers (mustard, emerald, crimson or razzmatazz) using your new password. Assuming all that works, your password has been reset. You cannot reset your password to one of the prior five passwords you have used for your account. 189c599562faf7c5ba05eeae68bc6fb56f62c80d 550 549 2025-01-19T22:35:51Z Weiler 3 wikitext text/x-wiki If you have forgotten you VPN password (which is also your PRISM UNIX password), send an email to '''cluster-admin@soe.ucsc.edu''' requesting that your password be reset (include your username in the request). Once we have sent you your new temporary password, you will need to: 1: Log into the PRISM VPN using this new temporary password. 2: Log into one of the servers behind the firewall (mustard, emerald, crimson or razzmatazz) using you new temporary password. 3: Once you log in there, it should ask you to type in your temporary password one more time, then it will ask you to choose a new password. Once you choose a new password (and type it twice for confirmation), it will log you out of your SSH session. 4: Log out (disconnect) from the VPN. '''This step is very important!''' 5: Log back into the VPN using your '''new''' password. 6: Log back into one of the servers (mustard, emerald, crimson or razzmatazz) using your new password. Assuming all that works, your password has been reset. You cannot reset your password to one of the prior five passwords you have used for your account. 2cd72a31e7fc81c15322e4d94b68494788676fb0 551 550 2025-01-19T22:36:22Z Weiler 3 wikitext text/x-wiki If you have forgotten you VPN password (which is also your PRISM UNIX password), send an email to '''cluster-admin@soe.ucsc.edu''' requesting that your password be reset (include your username in the request). Once we have sent you your new temporary password, you will need to: 1: Log into the PRISM VPN using this new temporary password. 2: Log into one of the servers behind the firewall (mustard, emerald, crimson or razzmatazz) using you new temporary password. 3: Once you log in there, it should ask you to type in your temporary password one more time, then it will ask you to choose a new password. Once you choose a new password (and type it twice for confirmation), it will log you out of your SSH session. 4: Log out (disconnect) from the VPN. '''This step is very important!''' 5: Log back into the VPN using your '''new''' password that you chose in step 3. 6: Log back into one of the servers (mustard, emerald, crimson or razzmatazz) using your new password. Assuming all that works, your password has been reset. You cannot reset your password to one of the prior five passwords you have used for your account. 35fb8739034d9b1184ce1f75e19d826ebaf58ab1 552 551 2025-01-27T19:04:18Z Weiler 3 wikitext text/x-wiki If you have forgotten you VPN password (which is also your PRISM UNIX password), send an email to '''cluster-admin@soe.ucsc.edu''' requesting that your password be reset (include your username in the request). Once we have sent you your new temporary password, you will need to: 1: Log into the PRISM VPN using this new temporary password. 2: Log into one of the servers behind the firewall (mustard, emerald, crimson or razzmatazz) using you new temporary password. 3: Once you log in there, it should ask you to type in your temporary password one more time, then it will ask you to choose a new password. If it does not ask you to change your password, use the '''passwd''' command to change your password. Once you choose a new password (and type it twice for confirmation), log out of your SSH session. 4: Log out (disconnect) from the VPN. '''This step is very important!''' 5: Log back into the VPN using your '''new''' password that you chose in step 3. 6: Log back into one of the servers (mustard, emerald, crimson or razzmatazz) using your new password. Assuming all that works, your password has been reset. You cannot reset your password to one of the prior five passwords you have used for your account. 61ff4a75547fdbce569870290808132e0c2d4dbe 553 552 2025-01-27T19:10:50Z Weiler 3 wikitext text/x-wiki If you have forgotten you VPN password (which is also your PRISM UNIX password), send an email to '''cluster-admin@soe.ucsc.edu''' requesting that your password be reset (include your username in the request). Once we have sent you your new temporary password, you will need to: 1: Log into the PRISM VPN using this new temporary password. 2: Log into one of the servers behind the firewall (mustard, emerald, crimson or razzmatazz) using you new temporary password. 3: Once you log in there, it should ask you to type in your temporary password one more time, then it will ask you to choose a new password. If it does not ask you to change your password, use the '''passwd''' command to change your password. Once you choose a new password (and type it twice for confirmation), log out of your SSH session. 4: Log out (disconnect) from the VPN. '''This step is very important!''' 5: Log back into the VPN using your '''new''' password that you chose in step 3. 6: Log back into one of the servers (mustard, emerald, crimson or razzmatazz) using your new password. Assuming all that works, your password has been reset. You cannot reset your password to one of the prior five passwords you have used for your account. b4be7dea82a76a68be39d413654a30019b4a6e60 Resetting your VPN/PRISM Password 0 60 554 553 2025-01-27T19:17:39Z Weiler 3 wikitext text/x-wiki If you have forgotten you VPN password (which is also your PRISM UNIX password), send an email to '''cluster-admin@soe.ucsc.edu''' requesting that your password be reset (include your username in the request). Once we have sent you your new temporary password, you will need to: 1: Log into the PRISM VPN using this new temporary password. 2: Log into one of the servers behind the firewall (mustard, emerald, crimson or razzmatazz) using you new temporary password. 3: Once you log in there, it should ask you to type in your temporary password one more time, then it will ask you to choose a new password. If it does not ask you to change your password, use the '''passwd''' command to change your password. Once you choose a new password (and type it twice for confirmation), log out of your SSH session. '''NOTE:''' Your new password must be 10 characters long, using three or more character classes (lowercase, uppercase, number or special character). 4: Log out (disconnect) from the VPN. '''This step is very important!''' 5: Log back into the VPN using your '''new''' password that you chose in step 3. 6: Log back into one of the servers (mustard, emerald, crimson or razzmatazz) using your new password. Assuming all that works, your password has been reset. You cannot reset your password to one of the prior five passwords you have used for your account. e2a935d6d021e8132c326449ec92c0c45b3bb07d Requirement for users to get GI VPN access 0 9 555 306 2025-02-08T15:19:44Z Weiler 3 wikitext text/x-wiki Before you are allowed access to our firewalled/secure area ("Prism"), you have to complete 3 items and provide the completed certificates or forms: '''1''': You must take and complete the NIH Public Security Refresher Course online. You must complete the course in a single continuous sitting: https://irtsectraining.nih.gov/public.aspx Click on the "Enter Public Training Portal" near the bottom of the page. The course is titled "2022 Information Security, Insider Threats, Privacy Awareness, Records Management and Emergency Preparedness Refresher". At the end you will be able to save the completion certificate that should have your name on it. '''2''': You need to sign the Genomics Institute VPN User Agreement (digital signature OK), located here for download: [[Media:GI_VPN_Policy.pdf]] '''3''': Please read and sign the last page of the NIH Genomic Data Sharing Policy agreement (digital signature OK), located here for download. By signing the document you agree that you have read and understand the policies described therein and that you agree to abide by those policies: [[Media:NIH_GDS_Policy.pdf]] When you have the three documents described above ready, please complete this form: https://app.smartsheet.com/b/form/a76dbd90ba0240ab9ea9d39b390586ce. There are two parts in this process. 1. For the user, please fill in ALL required fields '''and attach''' all three required documents described above. 2. For the Sponsor/PI - you will receive an email from Smartsheets. Please fill in all required fields and submit. The form then goes to your PI for approval - remind them to approve it, or it won't get sent to us for processing! We will receive your completed request and we will create your account and go over the details via a short zoom meeting with you. You will need access to the "eduroam" wireless network '''prior''' to your zoom appointment if you are on campus. Other UCSC wireless networks such as "cruznet" will not work with our VPN software, so please make sure your laptop works with eduroam before coming to your appointment. Instructions on how to get on eduroam are detailed here: https://its.ucsc.edu/wireless/eduroam.html When using the VPN software off-campus, it will usually work unless the wireless network you are on has restrictions preventing it from functioning. Some other universities have such restrictions (notably UCSF), but most other wireless network and home wireless networks should work fine. Before your appointment, please make sure you install the appropriate OpenVPN software on your laptop: A laptop running OS X, Windows or Ubuntu For Macs, please download and install '''Tunnelblick''' from https://giwiki.gi.ucsc.edu/downloads/Tunnelblick.dmg - we will send you the username and password for the website via email. For Windows, please download and install '''OpenVPN Client''' from https://openvpn.net/index.php/open-source/downloads.html. Select ''openvpn-install-x.x.x-xxxx.exe'' For Ubuntu, please install network-manager-openvpn by typing: sudo apt-get install network-manager-openvpn network-manager-openvpn-gnome Please do NOT worry about how to configure the software at this point. We will help you to set it up at your appointment. We will correspond with you via email on when the appointment will be. The zoom meeting can take up to 30 minutes per person depending on whether or not any issues come up during the software setup. If you show up for your appointment without one (or more) of the above outlined requirements, we will have to reschedule your appointment for a time when you can arrive after completing the above requirements. '''PLEASE NOTE:''' Because of the overhead required in setting up VPN access, please only request access if you have an immediate need to work on data that exists behind the firewall. We have had a decent number of people request access and go through the setup but then never use it. In other words, please do not request access because "one day you might need it", but because you '''do''' actually need it! '''ALSO NOTE:''' VPN accounts expire after one year from the date of first gaining access. To renew for another year you will need your PI/sponsor to send us a note asking for renewal. a8870673d344c8a3a7805cbb9b7e1a45eb99a8b0 556 555 2025-02-08T15:20:49Z Weiler 3 wikitext text/x-wiki Before you are allowed access to our firewalled/secure area ("Prism"), you have to complete 3 items and provide the completed certificates or forms: '''1''': You must take and complete the NIH Public Security Refresher Course online. You must complete the course in a single continuous sitting: https://irtsectraining.nih.gov/public.aspx Click on the "Enter Public Training Portal" near the bottom of the page. The course is titled "2022 Information Security, Insider Threats, Privacy Awareness, Records Management and Emergency Preparedness Refresher". At the end you will be able to save the completion certificate that should have your name on it. '''2''': You need to sign the Genomics Institute VPN User Agreement (digital signature OK), located here for download: [[Media:GI_VPN_Policy.pdf]] '''3''': Please read and sign the last page of the NIH Genomic Data Sharing Policy agreement (digital signature OK), located here for download. By signing the document you agree that you have read and understand the policies described therein and that you agree to abide by those policies: [[Media:NIH_GDS_Policy.pdf]] When you have the three documents described above ready, please complete this form: https://app.smartsheet.com/b/form/a76dbd90ba0240ab9ea9d39b390586ce. There are two parts in this process. 1. For the user, please fill in ALL required fields '''and attach''' all three required documents described above. 2. For the Sponsor/PI - you will receive an email from Smartsheets. Please fill in all required fields and submit. The form then goes to your PI for approval - remind them to approve it, or it won't get sent to us for processing! We will receive your completed request and we will create your account and go over the details via a short zoom meeting with you. You will need access to the "eduroam" wireless network '''prior''' to your zoom appointment if you are on campus. Other UCSC wireless networks such as "cruznet" will not work with our VPN software, so please make sure your laptop works with eduroam before coming to your appointment. Instructions on how to get on eduroam are detailed here: https://its.ucsc.edu/wireless/eduroam.html When using the VPN software off-campus, it will usually work unless the wireless network you are on has restrictions preventing it from functioning. Some other universities have such restrictions (notably UCSF), but most other wireless network and home wireless networks should work fine. Before your appointment, please make sure you install the appropriate OpenVPN software on your laptop: A laptop running OS X, Windows or Ubuntu For Macs, please download and install '''Tunnelblick''' from https://giwiki.gi.ucsc.edu/downloads/Tunnelblick.dmg - we will send you the username and password for the website via email. This VPN client is pre-packaged by us and will install fully configured and ready to go. For Windows, please download and install '''OpenVPN Client''' from https://openvpn.net/index.php/open-source/downloads.html. Select ''openvpn-install-x.x.x-xxxx.exe'' For Ubuntu, please install network-manager-openvpn by typing: sudo apt-get install network-manager-openvpn network-manager-openvpn-gnome Please do NOT worry about how to configure the software at this point. We will help you to set it up at your appointment. We will correspond with you via email on when the appointment will be. The zoom meeting can take up to 30 minutes per person depending on whether or not any issues come up during the software setup. If you show up for your appointment without one (or more) of the above outlined requirements, we will have to reschedule your appointment for a time when you can arrive after completing the above requirements. '''PLEASE NOTE:''' Because of the overhead required in setting up VPN access, please only request access if you have an immediate need to work on data that exists behind the firewall. We have had a decent number of people request access and go through the setup but then never use it. In other words, please do not request access because "one day you might need it", but because you '''do''' actually need it! '''ALSO NOTE:''' VPN accounts expire after one year from the date of first gaining access. To renew for another year you will need your PI/sponsor to send us a note asking for renewal. b294b737ee82d986d237a5b0d19ed95a5ba07477 557 556 2025-02-08T15:21:58Z Weiler 3 wikitext text/x-wiki Before you are allowed access to our firewalled/secure area ("Prism"), you have to complete 3 items and provide the completed certificates or forms: '''1''': You must take and complete the NIH Public Security Refresher Course online. You must complete the course in a single continuous sitting: https://irtsectraining.nih.gov/public.aspx Click on the "Enter Public Training Portal" near the bottom of the page. The course is titled "2022 Information Security, Insider Threats, Privacy Awareness, Records Management and Emergency Preparedness Refresher". At the end you will be able to save the completion certificate that should have your name on it. '''2''': You need to sign the Genomics Institute VPN User Agreement (digital signature OK), located here for download: [[Media:GI_VPN_Policy.pdf]] '''3''': Please read and sign the last page of the NIH Genomic Data Sharing Policy agreement (digital signature OK), located here for download. By signing the document you agree that you have read and understand the policies described therein and that you agree to abide by those policies: [[Media:NIH_GDS_Policy.pdf]] When you have the three documents described above ready, please complete this form: https://app.smartsheet.com/b/form/a76dbd90ba0240ab9ea9d39b390586ce. There are two parts in this process. 1. For the user, please fill in ALL required fields '''and attach''' all three required documents described above. The form then goes to your PI for approval - remind them to approve it, or it won't get sent to us for processing! 2. For the Sponsor/PI - you will receive an email from Smartsheets. Please fill in all required fields and submit. We will receive your completed request and we will create your account and go over the details via a short zoom meeting with you. You will need access to the "eduroam" wireless network '''prior''' to your zoom appointment if you are on campus. Other UCSC wireless networks such as "cruznet" will not work with our VPN software, so please make sure your laptop works with eduroam before coming to your appointment. Instructions on how to get on eduroam are detailed here: https://its.ucsc.edu/wireless/eduroam.html When using the VPN software off-campus, it will usually work unless the wireless network you are on has restrictions preventing it from functioning. Some other universities have such restrictions (notably UCSF), but most other wireless network and home wireless networks should work fine. Before your appointment, please make sure you install the appropriate OpenVPN software on your laptop: A laptop running OS X, Windows or Ubuntu For Macs, please download and install '''Tunnelblick''' from https://giwiki.gi.ucsc.edu/downloads/Tunnelblick.dmg - we will send you the username and password for the website via email. This VPN client is pre-packaged by us and will install fully configured and ready to go. For Windows, please download and install '''OpenVPN Client''' from https://openvpn.net/index.php/open-source/downloads.html. Select ''openvpn-install-x.x.x-xxxx.exe'' For Ubuntu, please install network-manager-openvpn by typing: sudo apt-get install network-manager-openvpn network-manager-openvpn-gnome Please do NOT worry about how to configure the software at this point. We will help you to set it up at your appointment. We will correspond with you via email on when the appointment will be. The zoom meeting can take up to 30 minutes per person depending on whether or not any issues come up during the software setup. If you show up for your appointment without one (or more) of the above outlined requirements, we will have to reschedule your appointment for a time when you can arrive after completing the above requirements. '''PLEASE NOTE:''' Because of the overhead required in setting up VPN access, please only request access if you have an immediate need to work on data that exists behind the firewall. We have had a decent number of people request access and go through the setup but then never use it. In other words, please do not request access because "one day you might need it", but because you '''do''' actually need it! '''ALSO NOTE:''' VPN accounts expire after one year from the date of first gaining access. To renew for another year you will need your PI/sponsor to send us a note asking for renewal. 18454863b7258af62dae4a91d90d577339c52b71 577 557 2025-02-09T16:06:40Z Weiler 3 wikitext text/x-wiki Before you are allowed access to our firewalled/secure area ("Prism"), you have to complete 3 items and provide the completed certificates or forms: '''1''': You must take and complete the NIH Public Security Refresher Course online. You must complete the course in a single continuous sitting: https://irtsectraining.nih.gov/public.aspx Click on the "Enter Public Training Portal" near the bottom of the page. The course is titled "2022 Information Security, Insider Threats, Privacy Awareness, Records Management and Emergency Preparedness Refresher". At the end you will be able to save the completion certificate that should have your name on it. '''2''': You need to sign the Genomics Institute VPN User Agreement (digital signature OK), located here for download: [[Media:GI_VPN_Policy.pdf]] '''3''': Please read and sign the last page of the NIH Genomic Data Sharing Policy agreement (digital signature OK), located here for download. By signing the document you agree that you have read and understand the policies described therein and that you agree to abide by those policies: [[Media:NIH_GDS_Policy.pdf]] When you have the three documents described above ready, please complete this form: https://app.smartsheet.com/b/form/a76dbd90ba0240ab9ea9d39b390586ce. There are two parts in this process. 1. For the user, please fill in ALL required fields '''and attach''' all three required documents described above. The form then goes to your PI for approval - remind them to approve it, or it won't get sent to us for processing! 2. For the Sponsor/PI - you will receive an email from Smartsheets. Please fill in all required fields and submit. We will receive your completed request and we will create your account, then you will receive a welcome email with instructions on how to configure your VPN and gain access to our systems. When using the VPN software off-campus, it will usually work unless the wireless network you are on has restrictions preventing it from functioning. Some other universities have such restrictions (notably UCSF), but most other wireless network and home wireless networks should work fine. For Windows, please download and install '''OpenVPN Client''' from https://openvpn.net/index.php/open-source/downloads.html. Select ''openvpn-install-x.x.x-xxxx.exe'' For Ubuntu, please install network-manager-openvpn by typing: sudo apt-get install network-manager-openvpn network-manager-openvpn-gnome '''PLEASE NOTE:''' Because of the overhead required in setting up VPN access, please only request access if you have an immediate need to work on data that exists behind the firewall. We have had a decent number of people request access and go through the setup but then never use it. In other words, please do not request access because "one day you might need it", but because you '''do''' actually need it! '''ALSO NOTE:''' VPN accounts typically expire after one year from the date of first gaining access. To renew for another year you will need your PI/sponsor to send us a note asking for renewal. 522e3d1e6d466eece613657957555c954a9d004e 597 577 2025-02-09T17:17:39Z Weiler 3 wikitext text/x-wiki Before you are allowed access to our firewalled/secure area ("Prism"), you have to complete 3 items and provide the completed certificates or forms: '''1''': You must take and complete the NIH Public Security Refresher Course online. You must complete the course in a single continuous sitting: https://irtsectraining.nih.gov/public.aspx Click on the "Enter Public Training Portal" near the bottom of the page. The course is titled "2022 Information Security, Insider Threats, Privacy Awareness, Records Management and Emergency Preparedness Refresher". At the end you will be able to save the completion certificate that should have your name on it. '''2''': You need to sign the Genomics Institute VPN User Agreement (digital signature OK), located here for download: [[Media:GI_VPN_Policy.pdf]] '''3''': Please read and sign the last page of the NIH Genomic Data Sharing Policy agreement (digital signature OK), located here for download. By signing the document you agree that you have read and understand the policies described therein and that you agree to abide by those policies: [[Media:NIH_GDS_Policy.pdf]] When you have the three documents described above ready, please complete this form: https://app.smartsheet.com/b/form/a76dbd90ba0240ab9ea9d39b390586ce. There are two parts in this process. 1. For the user, please fill in ALL required fields '''and attach''' all three required documents described above. The form then goes to your PI for approval - remind them to approve it, or it won't get sent to us for processing! 2. For the Sponsor/PI - you will receive an email from Smartsheets. Please fill in all required fields and submit. We will receive your completed request and we will create your account, then you will receive a welcome email with instructions on how to configure your VPN client and gain access to our systems. When using the VPN software off-campus, it will usually work unless the wireless network you are on has restrictions preventing it from functioning. Some other universities have such restrictions (notably UCSF), but most other wireless network and home wireless networks should work fine. '''PLEASE NOTE:''' Because of the overhead required in setting up VPN access, please only request access if you have an immediate need to work on data that exists behind the firewall. We have had a decent number of people request access and go through the setup but then never use it. In other words, please do not request access because "one day you might need it", but because you '''do''' actually need it! '''ALSO NOTE:''' VPN accounts typically expire after one year from the date of first gaining access. To renew for another year you will need your PI/sponsor to send us a note asking for renewal. e5ef1754dcb20f95225e7109f708feeec0e12c3b How to access the public servers 0 11 558 429 2025-02-08T16:14:36Z Weiler 3 wikitext text/x-wiki == How to Gain Access to the Public Genomics Institute Compute Servers == If you need access to the Genomics Institute compute servers please complete this request form: https://app.smartsheet.com/b/form/a76dbd90ba0240ab9ea9d39b390586ce. There are two parts in this process. 1. For the user, please fill in ALL required fields and submit. The form will get sent to your PI for approval. Remind them to approve it, or it won't get sent to the systems group for processing. 2. For the Sponsor/PI - you will receive an email from Smartsheet. Please fill in all required fields and submit. We will receive your completed request and we will create your account and go over the details via a short zoom meeting with you. == Account and Storage Cost == Costs for having an active UNIX account and for storage (per TB) are listed in this document, under "Genomics Project Support", specifically "Genomics IT Systems User Support per user" and "Genomics Data Storage per TB": https://planning.ucsc.edu/budget/rates-and-assessments/recharge-rates/docs/2021-22-approved-recharge-rates.pdf == Account Expiration == Your UNIX account will have an expiration date associated with it after creation, as requested by your sponsor. Please take note of this expiration date when your account is created. You will receive notice by email when your account is about to expire. To renew, simply ask the PI that sponsored you (which will be included in the notice) to email '''cluster-admin@soe.ucsc.edu''' requesting that your account to be renewed for another year, or any other requested amount of time. If your account expires, the account will be suspended and you will no longer be able to login or view any data you may have in our systems. Any automated scripts (owned by you) that run via cron or other mechanisms will cease to function. == Server Types and Management== You can ssh into our public compute servers via SSH: '''courtyard.gi.ucsc.edu''': 1TB RAM, 64 cores, 672GB local scratch space, Ubuntu 22.04.2 '''park.gi.ucsc.edu''': 256GB RAM, 32 cores, 5TB local scratch space, Ubuntu 22.04.2 These servers are managed by the Genomics Institute Cluster Admin group. If you need software installed on them, please make your request by emailing cluster-admin@soe.ucsc.edu. == Storage == These servers mount two types of storage; home directories and group storage directories. Your home directory will be located as "/public/home/username" and has a 30GB quota. The group storage directories are created per PI, and each group directory has a default 15TB quota (although in some cases the quota is higher). For example, if David Haussler is the PI that you report to directly, then the directory would exist as /public/groups/hausslerlab. Request access to that group directory and you will then be able to write to it. Each of those group directories are shared by the lab it belongs to, so you must be wary of everyone's data usage and share the 15TB available per group accordingly. On the compute servers you can check your group's current quota usage by using the '/usr/bin/viewquota' command. You can only check the quota of a group you are part of (you would be a member of the UNIX group of the same name). If you wanted to check the quota usage of /public/groups/hausslerlab for example, you would do: $ viewquota hausslerlab Project quota on /export (/dev/mapper/export) Project ID Used Soft Hard Warn/Grace ---------- --------------------------------- hausslerlab 1.8T 15T 16T 00 [------] == Actually Doing Work and Computing == When doing research, running jobs and the like, please be careful of your resource consumption on the server you are on. Don't run too many threads or cores at once if such a thing overruns the RAM available or the disk IO available. If you are not sure of your potential RAM, CPU or disk impact, start small with one or two processes and work your way up from there. Also, before running your stuff, check what else is already happening on the server by using the 'top' command to see who else and what else is running and what kind of resources are already being consumed. If, after starting a process, you realize that the server slows down considerably or becomes unusable, kill your processes and re-evaluate what you need to make things work. These servers are shared resources - be a good neighbor! == Serving Files to the Public via the Web == If you want to setup a web page on courtyard, or serve files over HTTP from there, do this: mkdir /public/home/''your_username''/public_html chmod 755 /public/home/''your_username''/public_html Put data in the public_html directory. The URL will be: http://public.gi.ucsc.edu/''~username''/ == /data/scratch Space on the Servers == Each server will generally have a local /data/scratch filesystem that you can use to store temporary files. '''BE ADVISED''' that /data/scratch is not backed up, and the data there could disappear in the event of a disk failure or anything else. Do not store important data there. If it is important, it should be moved somewhere else very soon after creation. a7deb4be75f3c9016cd3b4d69b12e7f13e3e1c95 Genomics Institute Computing Information 0 6 559 542 2025-02-08T19:31:55Z Weiler 3 /* VPN Access */ wikitext text/x-wiki Welcome to the Genomic Institute Computing Information Repository! Browse the below topics for help in the area you are curious about. == GI Public Computing Environment == *[[How to access the public servers]] == GI Firewalled Computing Environment (PRISM) == *[[Access to the Firewalled Compute Servers]] *[[Firewalled Computing Resources Overview]] *[[Firewalled Environment Storage Overview]] *[[Firewalled User Account and Storage Cost]] *[[Grafana Performance Metrics]] *[[Visual Studio Code (vscode) Configuration Tweaks]] *[http://logserv.gi.ucsc.edu/cgi-bin/private-groups.cgi '''/private/groups''' Data Usage Graphs] *[[Resetting your VPN/PRISM Password]] ==VPN Access== *[[Requirement for users to get GI VPN access]] *[[Setting Up The VPN on a Mac]] == NIH dbGaP Access Requirements == *[[Requirements for dbGaP Access]] == giCloud Openstack == *[[Overview of giCloud in the Genomics Institute]] *[[Quick Start Instructions to Get Rolling with OpenStack]] == Amazon Web Services Information == *[[Overview of Getting and Using an AWS IAM Account]] *[[AWS Account List and Numbers]] *[[AWS Shared Bucket Usage Graphs]] *[[AWS Best Practices]] *[[AWS S3 Lifecycle Management]] == Slurm at the Genomics Institute == *[[Overview of using Slurm]] *[[Cluster Etiquette]] *[[Annotated Slurm Script]] *[[Job Arrays]] *[[GPU Resources]] *[[Quick Reference Guide]] *[[Convenient Slurm Commands]] *[[Slurm Queues (Partitions) and Resource Management]] *[[Slurm Tips for vg]] *[[Slurm Tips for Toil]] *[[Using Docker under Slurm]] *[[Phoenix WDL Tutorial]] ==General Docker Information== *[[Running a Container as a non-root User]] == Problems or technical support == If you have any problems with the GI computing environment, please send an email to '''cluster-admin@soe.ucsc.edu''' 873c10870b3096bb1872bccdb3f41c74b924ac60 560 559 2025-02-09T00:18:01Z Weiler 3 /* VPN Access */ wikitext text/x-wiki Welcome to the Genomic Institute Computing Information Repository! Browse the below topics for help in the area you are curious about. == GI Public Computing Environment == *[[How to access the public servers]] == GI Firewalled Computing Environment (PRISM) == *[[Access to the Firewalled Compute Servers]] *[[Firewalled Computing Resources Overview]] *[[Firewalled Environment Storage Overview]] *[[Firewalled User Account and Storage Cost]] *[[Grafana Performance Metrics]] *[[Visual Studio Code (vscode) Configuration Tweaks]] *[http://logserv.gi.ucsc.edu/cgi-bin/private-groups.cgi '''/private/groups''' Data Usage Graphs] *[[Resetting your VPN/PRISM Password]] ==VPN Access== *[[Requirement for users to get GI VPN access]] *[[Setting Up The VPN on MacOS]] *[[Setting Up The VPN on Windows]] *[[Setting Up The VPN on Linux]] == NIH dbGaP Access Requirements == *[[Requirements for dbGaP Access]] == giCloud Openstack == *[[Overview of giCloud in the Genomics Institute]] *[[Quick Start Instructions to Get Rolling with OpenStack]] == Amazon Web Services Information == *[[Overview of Getting and Using an AWS IAM Account]] *[[AWS Account List and Numbers]] *[[AWS Shared Bucket Usage Graphs]] *[[AWS Best Practices]] *[[AWS S3 Lifecycle Management]] == Slurm at the Genomics Institute == *[[Overview of using Slurm]] *[[Cluster Etiquette]] *[[Annotated Slurm Script]] *[[Job Arrays]] *[[GPU Resources]] *[[Quick Reference Guide]] *[[Convenient Slurm Commands]] *[[Slurm Queues (Partitions) and Resource Management]] *[[Slurm Tips for vg]] *[[Slurm Tips for Toil]] *[[Using Docker under Slurm]] *[[Phoenix WDL Tutorial]] ==General Docker Information== *[[Running a Container as a non-root User]] == Problems or technical support == If you have any problems with the GI computing environment, please send an email to '''cluster-admin@soe.ucsc.edu''' 6f6df3ec8680ba9dcaada261c086c26b6e99deec 603 560 2025-02-10T17:50:23Z Weiler 3 /* VPN Access */ wikitext text/x-wiki Welcome to the Genomic Institute Computing Information Repository! Browse the below topics for help in the area you are curious about. == GI Public Computing Environment == *[[How to access the public servers]] == GI Firewalled Computing Environment (PRISM) == *[[Access to the Firewalled Compute Servers]] *[[Firewalled Computing Resources Overview]] *[[Firewalled Environment Storage Overview]] *[[Firewalled User Account and Storage Cost]] *[[Grafana Performance Metrics]] *[[Visual Studio Code (vscode) Configuration Tweaks]] *[http://logserv.gi.ucsc.edu/cgi-bin/private-groups.cgi '''/private/groups''' Data Usage Graphs] *[[Resetting your VPN/PRISM Password]] ==VPN Access== *[[Requirement for users to get GI VPN access]] *[[Setting Up The VPN on MacOS]] *[[Setting Up The VPN on Windows]] *[[Setting Up The VPN on Linux]] *[[Multi Factor Authentication (MFA) Frequently Asked Questions]] == NIH dbGaP Access Requirements == *[[Requirements for dbGaP Access]] == giCloud Openstack == *[[Overview of giCloud in the Genomics Institute]] *[[Quick Start Instructions to Get Rolling with OpenStack]] == Amazon Web Services Information == *[[Overview of Getting and Using an AWS IAM Account]] *[[AWS Account List and Numbers]] *[[AWS Shared Bucket Usage Graphs]] *[[AWS Best Practices]] *[[AWS S3 Lifecycle Management]] == Slurm at the Genomics Institute == *[[Overview of using Slurm]] *[[Cluster Etiquette]] *[[Annotated Slurm Script]] *[[Job Arrays]] *[[GPU Resources]] *[[Quick Reference Guide]] *[[Convenient Slurm Commands]] *[[Slurm Queues (Partitions) and Resource Management]] *[[Slurm Tips for vg]] *[[Slurm Tips for Toil]] *[[Using Docker under Slurm]] *[[Phoenix WDL Tutorial]] ==General Docker Information== *[[Running a Container as a non-root User]] == Problems or technical support == If you have any problems with the GI computing environment, please send an email to '''cluster-admin@soe.ucsc.edu''' 0f97ec3a9491ee90e4d0a24018fa091028ea4ee6 Setting Up The VPN on MacOS 0 61 561 2025-02-09T15:14:03Z Weiler 3 Created page with "For MacOS, you will be installing "Tunnelblick", an OpenVPN client software package for Mac. This software is pre-packaged for our systems and already includes built-in configuration. Do not install this software on public or shared computers! You will need to download Tunnelblick from this link: https://giwiki.gi.ucsc.edu/downloads/Tunnelblick.dmg The username and password to access that website will be sent to you in your account creation welcome email. Once you..." wikitext text/x-wiki For MacOS, you will be installing "Tunnelblick", an OpenVPN client software package for Mac. This software is pre-packaged for our systems and already includes built-in configuration. Do not install this software on public or shared computers! You will need to download Tunnelblick from this link: https://giwiki.gi.ucsc.edu/downloads/Tunnelblick.dmg The username and password to access that website will be sent to you in your account creation welcome email. Once you have downloaded the file, navigate in the Finder to wherever you downloaded it to and double click on it. It should open the .dmg file and you should see the Tunnelblick application icon. Double click on Tunnelblick to install. During installation, it will ask you if you want to install for "Only You" or "All Users". Select "Only You". After installation, in your Finder, you may want to navigate to the Applications folder and drag the Tunnelblick icon to your dock for easy launching. After launching the software from the Applications folder, you will see a small "tunnel" icon on the top right of your screen. You should be able to click on that icon, then click "Connect prism" to start the VPN. Use the username and temporary password to login to the VPN for the first time. After typing in your username and password, you will be sent a Duo MFA push to your phone. Accept that push, and then you will be connected. Once you authenticate to the VPN (username/password/MFA), then login via SSH to 'mustard.prism' for example, and you will be asked to change your password. Once you change your password, log out of mustard, then log out of the VPN (click the Tunnelblick icon on the top right of your screen and select "disconnect"). Then, log back into the VPN using your '''new''' password. It will send another Duo MFA push to your phone, then you should be logged in! 2554f148f5e21b7ae4354b15c4155bbdd5a7509d 562 561 2025-02-09T15:14:37Z Weiler 3 wikitext text/x-wiki For MacOS, you will be installing "Tunnelblick", an OpenVPN client software package for Mac. This software is pre-packaged for our systems and already includes built-in configuration. Do not install this software on public or shared computers! You will need to download Tunnelblick from this link: https://giwiki.gi.ucsc.edu/downloads/Tunnelblick.dmg The username and password to access that website will be sent to you in your account creation welcome email. Once you have downloaded the file, navigate in the Finder to wherever you downloaded it to and double click on it. It should open the .dmg file and you should see the Tunnelblick application icon. Double click on the Tunnelblick icon to install. During installation, it will ask you if you want to install for "Only You" or "All Users". Select "Only You". After installation, in your Finder, you may want to navigate to the Applications folder and drag the Tunnelblick icon to your dock for easy launching. After launching the software from the Applications folder, you will see a small "tunnel" icon on the top right of your screen. You should be able to click on that icon, then click "Connect prism" to start the VPN. Use the username and temporary password to login to the VPN for the first time. After typing in your username and password, you will be sent a Duo MFA push to your phone. Accept that push, and then you will be connected. Once you authenticate to the VPN (username/password/MFA), then login via SSH to 'mustard.prism' for example, and you will be asked to change your password. Once you change your password, log out of mustard, then log out of the VPN (click the Tunnelblick icon on the top right of your screen and select "disconnect"). Then, log back into the VPN using your '''new''' password. It will send another Duo MFA push to your phone, then you should be logged in! 44bdcc16c261cfed25f544358bf71ccc2a60db0f 563 562 2025-02-09T15:21:43Z Weiler 3 wikitext text/x-wiki For MacOS, you will be installing "Tunnelblick", an OpenVPN client software package for Mac. This software is pre-packaged for our systems and already includes built-in configuration. Do not install this software on public or shared computers! Before installing Tunnelblick, you must have enrolled your cell phone for Duo MFA using your CruzID account with UCSC. Most folks already have this from when they first started at UCSC. If you have a CruzID but haven't yet enrolled your cell phone, please follow the instructions here to enroll your phone: https://its.ucsc.edu/mfa/enroll.html After confirming your cell phone MFA enrollment, or if you have already done this a while ago, continue to install Tunnelblick. You will need to download Tunnelblick from this link: https://giwiki.gi.ucsc.edu/downloads/Tunnelblick.dmg The username and password to access that website will be sent to you in your account creation welcome email. Once you have downloaded the file, navigate in the Finder to wherever you downloaded it to and double click on it. It should open the .dmg file and you should see the Tunnelblick application icon. Double click on the Tunnelblick icon to install. During installation, it will ask you if you want to install for "Only You" or "All Users". Select "Only You". After installation, in your Finder, you may want to navigate to the Applications folder and drag the Tunnelblick icon to your dock for easy launching. After launching the software from the Applications folder, you will see a small "tunnel" icon on the top right of your screen. You should be able to click on that icon, then click "Connect prism" to start the VPN. Use the username and temporary password to login to the VPN for the first time. After typing in your username and password, you will be sent a Duo MFA push to your phone. Accept that push, and then you will be connected. Once you authenticate to the VPN (username/password/MFA), then login via SSH to 'mustard.prism' for example, and you will be asked to change your password. Once you change your password, log out of mustard, then log out of the VPN (click the Tunnelblick icon on the top right of your screen and select "disconnect"). Then, log back into the VPN using your '''new''' password. It will send another Duo MFA push to your phone, then you should be logged in! 059a9cece9149ddb31d0279fcffcc9766dab5738 564 563 2025-02-09T15:21:52Z Weiler 3 wikitext text/x-wiki For MacOS, you will be installing "Tunnelblick", an OpenVPN client software package for Mac. This software is pre-packaged for our systems and already includes built-in configuration. Do not install this software on public or shared computers! Before installing Tunnelblick, you must have enrolled your cell phone for Duo MFA using your CruzID account with UCSC. Most folks already have this from when they first started at UCSC. If you have a CruzID but haven't yet enrolled your cell phone, please follow the instructions here to enroll your phone: https://its.ucsc.edu/mfa/enroll.html After confirming your cell phone MFA enrollment, or if you have already done this a while ago, continue to install Tunnelblick. You will need to download Tunnelblick from this link: https://giwiki.gi.ucsc.edu/downloads/Tunnelblick.dmg The username and password to access that website will be sent to you in your account creation welcome email. Once you have downloaded the file, navigate in the Finder to wherever you downloaded it to and double click on it. It should open the .dmg file and you should see the Tunnelblick application icon. Double click on the Tunnelblick icon to install. During installation, it will ask you if you want to install for "Only You" or "All Users". Select "Only You". After installation, in your Finder, you may want to navigate to the Applications folder and drag the Tunnelblick icon to your dock for easy launching. After launching the software from the Applications folder, you will see a small "tunnel" icon on the top right of your screen. You should be able to click on that icon, then click "Connect prism" to start the VPN. Use the username and temporary password to login to the VPN for the first time. After typing in your username and password, you will be sent a Duo MFA push to your phone. Accept that push, and then you will be connected. Once you authenticate to the VPN (username/password/MFA), then login via SSH to 'mustard.prism' for example, and you will be asked to change your password. Once you change your password, log out of mustard, then log out of the VPN (click the Tunnelblick icon on the top right of your screen and select "disconnect"). Then, log back into the VPN using your '''new''' password. It will send another Duo MFA push to your phone, then you should be logged in! 80c44cae8a33962b73baad5b7aef822dde5968b6 565 564 2025-02-09T15:22:45Z Weiler 3 wikitext text/x-wiki For MacOS, you will be installing "Tunnelblick", an OpenVPN client software package for Mac. This software is pre-packaged for our systems and already includes built-in configuration. Do not install this software on public or shared computers! Before installing Tunnelblick, you must have enrolled your cell phone for Duo MFA using your CruzID account with UCSC. Most folks already have this from when they first started at UCSC. If you have a CruzID but haven't yet enrolled your cell phone, please follow the instructions here to enroll your phone: https://its.ucsc.edu/mfa/enroll.html After confirming your cell phone MFA enrollment, or if you have already done this a while ago, continue to install Tunnelblick. You will need to download Tunnelblick from this link: https://giwiki.gi.ucsc.edu/downloads/Tunnelblick.dmg The username and password to access that web link will be sent to you in your account creation welcome email. Once you have downloaded the file, navigate in the Finder to wherever you downloaded it to and double click on it. It should open the .dmg file and you should see the Tunnelblick application icon. Double click on the Tunnelblick icon to install. During installation, it will ask you if you want to install for "Only You" or "All Users". Select "Only You". After installation, in your Finder, you may want to navigate to the Applications folder and drag the Tunnelblick icon to your dock for easy launching. After launching the software from the Applications folder, you will see a small "tunnel" icon on the top right of your screen. You should be able to click on that icon, then click "Connect prism" to start the VPN. Use the username and temporary password to login to the VPN for the first time. After typing in your username and password, you will be sent a Duo MFA push to your phone. Accept that push, and then you will be connected. Once you authenticate to the VPN (username/password/MFA), then login via SSH to 'mustard.prism' for example, and you will be asked to change your password. Once you change your password, log out of mustard, then log out of the VPN (click the Tunnelblick icon on the top right of your screen and select "disconnect"). Then, log back into the VPN using your '''new''' password. It will send another Duo MFA push to your phone, then you should be logged in! 47cce4b35454d9483a1a45df95519c7e296b717b 566 565 2025-02-09T15:24:28Z Weiler 3 wikitext text/x-wiki For MacOS, you will be installing "Tunnelblick", an OpenVPN client software package for Mac. This software is pre-packaged for our systems and already includes built-in configuration. Do not install this software on public or shared computers! Before installing Tunnelblick, you must have enrolled your cell phone for Duo MFA using your CruzID account with UCSC. Most folks already have this from when they first started at UCSC. If you don't yet have a CruzID, please contact your sponsor/PI and ask them to help you acquire a CruzID. If you have a CruzID but haven't yet enrolled your cell phone, please follow the instructions here to enroll your phone: https://its.ucsc.edu/mfa/enroll.html After confirming your cell phone MFA enrollment, or if you have already done this a while ago, continue to install Tunnelblick. You will need to download Tunnelblick from this link: https://giwiki.gi.ucsc.edu/downloads/Tunnelblick.dmg The username and password to access that web link will be sent to you in your account creation welcome email. Once you have downloaded the file, navigate in the Finder to wherever you downloaded it to and double click on it. It should open the .dmg file and you should see the Tunnelblick application icon. Double click on the Tunnelblick icon to install. During installation, it will ask you if you want to install for "Only You" or "All Users". Select "Only You". After installation, in your Finder, you may want to navigate to the Applications folder and drag the Tunnelblick icon to your dock for easy launching. After launching the software from the Applications folder, you will see a small "tunnel" icon on the top right of your screen. You should be able to click on that icon, then click "Connect prism" to start the VPN. Use the username and temporary password to login to the VPN for the first time. After typing in your username and password, you will be sent a Duo MFA push to your phone. Accept that push, and then you will be connected. Once you authenticate to the VPN (username/password/MFA), then login via SSH to 'mustard.prism' for example, and you will be asked to change your password. Once you change your password, log out of mustard, then log out of the VPN (click the Tunnelblick icon on the top right of your screen and select "disconnect"). Then, log back into the VPN using your '''new''' password. It will send another Duo MFA push to your phone, then you should be logged in! 308a5eee3bf08b9df8cafc0b1347e8cca1ea9379 567 566 2025-02-09T15:26:31Z Weiler 3 wikitext text/x-wiki For MacOS, you will be installing "Tunnelblick", an OpenVPN client software package for Mac. This software is pre-packaged for our systems and already includes built-in configuration. Do not install this software on public or shared computers! Before installing Tunnelblick, you must have enrolled your cell phone for Duo MFA using your CruzID account with UCSC. Most folks already have this from when they first started at UCSC. If you don't yet have a CruzID, please contact your sponsor/PI and ask them to help you acquire a CruzID. If you have a CruzID but haven't yet enrolled your cell phone, please follow the instructions here to enroll your phone: https://its.ucsc.edu/mfa/enroll.html After confirming your cell phone MFA enrollment, or if you have already done this a while ago, continue to install Tunnelblick. You will need to download Tunnelblick from this link: https://giwiki.gi.ucsc.edu/downloads/Tunnelblick.dmg The username and password to access that web link will be sent to you in your account creation welcome email. Once you have downloaded the file, navigate in the Finder to wherever you downloaded it to and double click on it. It should open the .dmg file and you should see the Tunnelblick application icon. Double click on the Tunnelblick icon to install. During installation, it will ask you if you want to install for "Only You" or "All Users". Select "Only You". After installation, in your Finder, you may want to navigate to the Applications folder and drag the Tunnelblick icon to your dock for easy launching. After launching the software from the Applications folder, you will see a small "tunnel" icon on the top right of your screen. You should be able to click on that icon, then click "Connect prism" to start the VPN. Use the username and temporary password that we sent to you in your account creation welcome email to login to the VPN for the first time. After typing in your username and password, you will be sent a Duo MFA push to your phone. Accept that push, and then you will be connected. Once you authenticate to the VPN (username/password/MFA), then login via SSH to 'mustard.prism' for example, and you will be asked to change your password. Once you change your password, log out of mustard, then log out of the VPN (click the Tunnelblick icon on the top right of your screen and select "disconnect"). Then, log back into the VPN using your '''new''' password. It will send another Duo MFA push to your phone, then you should be logged in! c1e9823d39a9ba0f623523d3f90d29fa5b51dd4c 568 567 2025-02-09T15:34:43Z Weiler 3 wikitext text/x-wiki For MacOS, you will be installing "Tunnelblick", an OpenVPN client software package for Mac. This software is pre-packaged for our systems and already includes built-in configuration. Do not install this software on public or shared computers! Before installing Tunnelblick, you must have enrolled your cell phone for Duo MFA using your CruzID account with UCSC. Most folks already have this from when they first started at UCSC. If you don't yet have a CruzID, please contact your sponsor/PI and ask them to help you acquire a CruzID. If you have a CruzID but haven't yet enrolled your cell phone, please follow the instructions here to enroll your phone: https://its.ucsc.edu/mfa/enroll.html After confirming your cell phone MFA enrollment, or if you have already done this a while ago, continue to install Tunnelblick. You will need to download Tunnelblick from this link: https://giwiki.gi.ucsc.edu/downloads/Tunnelblick.dmg The username and password to access that web link will be sent to you in your account creation welcome email. Once you have downloaded the file, navigate in the Finder to wherever you downloaded it to and double click on it. It should open the .dmg file and you should see the Tunnelblick application icon. Double click on the Tunnelblick icon to install. During installation, it will ask you if you want to install for "Only You" or "All Users". Select "Only You". After installation, in your Finder, you may want to navigate to the Applications folder and drag the Tunnelblick icon to your dock for easy launching. After launching the software from the Applications folder, you will see a small "tunnel" icon on the top right of your screen. You should be able to click on that icon, then click "Connect prism" to start the VPN. Use the username and temporary password that we sent to you in your account creation welcome email to login to the VPN for the first time. After typing in your username and password, you will be sent a Duo MFA push to your phone. Accept that push, and then you will be connected. Once you authenticate to the VPN (username/password/MFA), then login via SSH to 'mustard.prism' for example, and you will be asked to change your password. If you are not familiar with SSH, then you will need to open the "Terminal" application which can be found in your Applications Folder under "Utilities". After launching "Terminal" you will connect to mustard by typing: ssh username@mustard.prism Where "username" is the username we sent you in the welcome email (incidentally it is also your CruzID username). It will ask you for a password, just type in the password we sent you in your account creation welcome email. When you type the password, the characters '''will not''' echo to the screen, so it will not show you what you are typing. Once logging in successfully to mustard, it will as you to change your password. It will ask for you current password one more time, then it will ask you to choose a new password, which you will need to enter two times. Again, whatever password you choose '''will not''' echo to the screen. You new password must be: 1: At least 10 characters long 2: At least 3 character classes (lowercase, uppercase, number and/or special character) Once you change your password, it will log you out of mustard, then log out of the VPN (click the Tunnelblick icon on the top right of your screen and select "disconnect"). Then, log back into the VPN using your '''new''' password. It will send another Duo MFA push to your phone, then you should be logged in! Then feel free to ssh to any of our firewalled servers (using your new password). As always, if you have any questions, please email '''cluster-admin@soe.ucsc.edu''' for help. 6ce4e7f88a6cb1a7ad5f5c190dc3198ea4c08047 569 568 2025-02-09T15:35:49Z Weiler 3 wikitext text/x-wiki For MacOS, you will be installing "Tunnelblick", an OpenVPN client software package for Mac. This software is pre-packaged for our systems and already includes built-in configuration. Do not install this software on public or shared computers! Before installing Tunnelblick, you must have enrolled your cell phone for Duo MFA using your CruzID account with UCSC. Most folks already have this from when they first started at UCSC. If you don't yet have a CruzID, please contact your sponsor/PI and ask them to help you acquire a CruzID. If you have a CruzID but haven't yet enrolled your cell phone, please follow the instructions here to enroll your phone: https://its.ucsc.edu/mfa/enroll.html After confirming your cell phone MFA enrollment, or if you have already done this a while ago, continue to install Tunnelblick. You will need to download Tunnelblick from this link: https://giwiki.gi.ucsc.edu/downloads/Tunnelblick.dmg The username and password to access that web link will be sent to you in your account creation welcome email. Once you have downloaded the file, navigate in the Finder to wherever you downloaded it to and double click on it. It should open the .dmg file and you should see the Tunnelblick application icon. Double click on the Tunnelblick icon to install. During installation, it will ask you if you want to install for "Only You" or "All Users". Select "Only You". After installation, in your Finder, you may want to navigate to the Applications folder and drag the Tunnelblick icon to your dock for easy launching. After launching the software from the Applications folder, you will see a small "tunnel" icon on the top right of your screen. You should be able to click on that icon, then click "Connect prism" to start the VPN. Use the username and temporary password that we sent to you in your account creation welcome email to login to the VPN for the first time. After typing in your username and password, you will be sent a Duo MFA push to your phone. Accept that push, and then you will be connected. Once you authenticate to the VPN (username/password/MFA), then login via SSH to 'mustard.prism' for example, and you will be asked to change your password. If you are not familiar with SSH, then you will need to open the "Terminal" application which can be found in your Applications Folder under "Utilities". After launching "Terminal" you will connect to mustard by typing: ssh username@mustard.prism Where "username" is the username we sent you in the welcome email (incidentally it is also your CruzID username). It will ask you for a password, just type in the password we sent you in your account creation welcome email. When you type the password, the characters '''will not''' echo to the screen, so it will not show you what you are typing. Once logging in successfully to mustard, it will as you to change your password. It will ask for you current password one more time, then it will ask you to choose a new password, which you will need to enter two times. Again, whatever password you choose '''will not''' echo to the screen. You new password must be: 1: At least 10 characters long 2: At least 3 character classes (lowercase, uppercase, number and/or special character) Once you change your password, it will log you out of mustard. Then, log out of the VPN (click the Tunnelblick icon on the top right of your screen and select "disconnect"). Then, log back into the VPN using your '''new''' password. It will send another Duo MFA push to your phone, then you should be logged in! Then feel free to ssh to any of our firewalled servers (using your new password). As always, if you have any questions, please email '''cluster-admin@soe.ucsc.edu''' for help. 45799b4d35bbaf4d2792712107aed8f5e8445043 570 569 2025-02-09T15:36:42Z Weiler 3 wikitext text/x-wiki For MacOS, you will be installing "Tunnelblick", an OpenVPN client software package for Mac. This software is pre-packaged for our systems and already includes built-in configuration. Do not install this software on public or shared computers! Before installing Tunnelblick, you must have enrolled your cell phone for Duo MFA using your CruzID account with UCSC. Most folks already have this from when they first started at UCSC. If you don't yet have a CruzID, please contact your sponsor/PI and ask them to help you acquire a CruzID. If you have a CruzID but haven't yet enrolled your cell phone, please follow the instructions here to enroll your phone: https://its.ucsc.edu/mfa/enroll.html After confirming your cell phone MFA enrollment, or if you have already done this a while ago, continue to install Tunnelblick. You will need to download Tunnelblick from this link: https://giwiki.gi.ucsc.edu/downloads/Tunnelblick.dmg The username and password to access that web link will be sent to you in your account creation welcome email. Once you have downloaded the file, navigate in the Finder to wherever you downloaded it to and double click on it. It should open the .dmg file and you should see the Tunnelblick application icon. Double click on the Tunnelblick icon to install. During installation, it will ask you if you want to install for "Only You" or "All Users". Select "Only You". After installation, in your Finder, you may want to navigate to the Applications folder and drag the Tunnelblick icon to your dock for easy launching. After launching Tunnelblick from the Applications folder, you will see a small "tunnel" icon on the top right of your screen. You should be able to click on that icon, then click "Connect prism" to start the VPN. Use the username and temporary password that we sent to you in your account creation welcome email to login to the VPN for the first time. After typing in your username and password, you will be sent a Duo MFA push to your phone. Accept that push, and then you will be connected. Once you authenticate to the VPN (username/password/MFA), then login via SSH to 'mustard.prism' for example, and you will be asked to change your password. If you are not familiar with SSH, then you will need to open the "Terminal" application which can be found in your Applications Folder under "Utilities". After launching "Terminal" you will connect to mustard by typing: ssh username@mustard.prism Where "username" is the username we sent you in the welcome email (incidentally it is also your CruzID username). It will ask you for a password, just type in the password we sent you in your account creation welcome email. When you type the password, the characters '''will not''' echo to the screen, so it will not show you what you are typing. Once logging in successfully to mustard, it will as you to change your password. It will ask for you current password one more time, then it will ask you to choose a new password, which you will need to enter two times. Again, whatever password you choose '''will not''' echo to the screen. You new password must be: 1: At least 10 characters long 2: At least 3 character classes (lowercase, uppercase, number and/or special character) Once you change your password, it will log you out of mustard. Then, log out of the VPN (click the Tunnelblick icon on the top right of your screen and select "disconnect"). Then, log back into the VPN using your '''new''' password. It will send another Duo MFA push to your phone, then you should be logged in! Then feel free to ssh to any of our firewalled servers (using your new password). As always, if you have any questions, please email '''cluster-admin@soe.ucsc.edu''' for help. 7642b81a00a7ba123f10358df8b5c4538e74aa03 571 570 2025-02-09T15:42:03Z Weiler 3 wikitext text/x-wiki For MacOS, you will be installing "Tunnelblick", an OpenVPN client software package for Mac. This software is pre-packaged for our systems and already includes built-in configuration. Do not install this software on public or shared computers! Before installing Tunnelblick, you must have enrolled your cell phone for Duo MFA using your CruzID account with UCSC. Most folks already have this from when they first started at UCSC. If you don't yet have a CruzID, please contact your sponsor/PI and ask them to help you acquire a CruzID. If you have a CruzID but haven't yet enrolled your cell phone, please follow the instructions here to enroll your phone: https://its.ucsc.edu/mfa/enroll.html After confirming your cell phone MFA enrollment, or if you have already done this a while ago, continue to install Tunnelblick. You will need to download Tunnelblick from this link: https://giwiki.gi.ucsc.edu/downloads/Tunnelblick.dmg The username and password to access that web link will be sent to you in your account creation welcome email. Once you have downloaded the file, navigate in the Finder to wherever you downloaded it to and double click on it. It should open the .dmg file and you should see the Tunnelblick application icon. Double click on the Tunnelblick icon to install. During installation, it will ask you if you want to install for "Only You" or "All Users". Select "Only You". After installation, in your Finder, you may want to navigate to the Applications folder and drag the Tunnelblick icon to your dock for easy launching. After launching Tunnelblick from the Applications folder, you will see a small "tunnel" icon on the top right of your screen. You should be able to click on that icon, then click "Connect prism" to start the VPN. "Prism" is the name of our firewalled environment. Use the username and temporary password that we sent to you in your account creation welcome email to login to the VPN for the first time. After typing in your username and password, you will be sent a Duo MFA push to your phone. Accept that push, and then you will be connected. Once you authenticate to the VPN (username/password/MFA), then login via SSH to 'mustard.prism' for example, and you will be asked to change your password. If you are not familiar with SSH, then you will need to open the "Terminal" application which can be found in your Applications Folder under "Utilities". After launching "Terminal" you will connect to mustard by typing: ssh username@mustard.prism Where "username" is the username we sent you in the welcome email (incidentally it is also your CruzID username). It will ask you for a password, just type in the password we sent you in your account creation welcome email. When you type the password, the characters '''will not''' echo to the screen, so it will not show you what you are typing. Once logging in successfully to mustard, it will as you to change your password. It will ask for you current password one more time, then it will ask you to choose a new password, which you will need to enter two times. Again, whatever password you choose '''will not''' echo to the screen. You new password must be: 1: At least 10 characters long 2: At least 3 character classes (lowercase, uppercase, number and/or special character) Once you change your password, it will log you out of mustard. Then, log out of the VPN (click the Tunnelblick icon on the top right of your screen and select "disconnect"). Then, log back into the VPN using your '''new''' password. It will send another Duo MFA push to your phone, then you should be logged in! Then feel free to ssh to any of our firewalled servers (using your new password). As always, if you have any questions, please email '''cluster-admin@soe.ucsc.edu''' for help. 28520e0b13afd3a5c94f568567b7fedef867c5f3 572 571 2025-02-09T15:45:25Z Weiler 3 wikitext text/x-wiki Before following these instructions, please ensure that you have filled out an account request form and completed all the training and requirements as detailed here: [[Requirement_for_users_to_get_GI_VPN_access]] For MacOS, you will be installing "Tunnelblick", an OpenVPN client software package for Mac. This software is pre-packaged for our systems and already includes built-in configuration. Do not install this software on public or shared computers! Before installing Tunnelblick, you must have enrolled your cell phone for Duo MFA using your CruzID account with UCSC. Most folks already have this from when they first started at UCSC. If you don't yet have a CruzID, please contact your sponsor/PI and ask them to help you acquire a CruzID. If you have a CruzID but haven't yet enrolled your cell phone, please follow the instructions here to enroll your phone: https://its.ucsc.edu/mfa/enroll.html After confirming your cell phone MFA enrollment, or if you have already done this a while ago, continue to install Tunnelblick. You will need to download Tunnelblick from this link: https://giwiki.gi.ucsc.edu/downloads/Tunnelblick.dmg The username and password to access that web link will be sent to you in your account creation welcome email. Once you have downloaded the file, navigate in the Finder to wherever you downloaded it to and double click on it. It should open the .dmg file and you should see the Tunnelblick application icon. Double click on the Tunnelblick icon to install. During installation, it will ask you if you want to install for "Only You" or "All Users". Select "Only You". After installation, in your Finder, you may want to navigate to the Applications folder and drag the Tunnelblick icon to your dock for easy launching. After launching Tunnelblick from the Applications folder, you will see a small "tunnel" icon on the top right of your screen. You should be able to click on that icon, then click "Connect prism" to start the VPN. "Prism" is the name of our firewalled environment. Use the username and temporary password that we sent to you in your account creation welcome email to login to the VPN for the first time. After typing in your username and password, you will be sent a Duo MFA push to your phone. Accept that push, and then you will be connected. Once you authenticate to the VPN (username/password/MFA), then login via SSH to 'mustard.prism' for example, and you will be asked to change your password. If you are not familiar with SSH, then you will need to open the "Terminal" application which can be found in your Applications Folder under "Utilities". After launching "Terminal" you will connect to mustard by typing: ssh username@mustard.prism Where "username" is the username we sent you in the welcome email (incidentally it is also your CruzID username). It will ask you for a password, just type in the password we sent you in your account creation welcome email. When you type the password, the characters '''will not''' echo to the screen, so it will not show you what you are typing. Once logging in successfully to mustard, it will as you to change your password. It will ask for you current password one more time, then it will ask you to choose a new password, which you will need to enter two times. Again, whatever password you choose '''will not''' echo to the screen. You new password must be: 1: At least 10 characters long 2: At least 3 character classes (lowercase, uppercase, number and/or special character) Once you change your password, it will log you out of mustard. Then, log out of the VPN (click the Tunnelblick icon on the top right of your screen and select "disconnect"). Then, log back into the VPN using your '''new''' password. It will send another Duo MFA push to your phone, then you should be logged in! Then feel free to ssh to any of our firewalled servers (using your new password). As always, if you have any questions, please email '''cluster-admin@soe.ucsc.edu''' for help. 8bbee2eb11da4321674e0f9d93be79f54f74320b 573 572 2025-02-09T15:45:49Z Weiler 3 wikitext text/x-wiki '''Before''' following these instructions, please ensure that you have filled out an account request form and completed all the training and requirements as detailed here: [[Requirement_for_users_to_get_GI_VPN_access]] For MacOS, you will be installing "Tunnelblick", an OpenVPN client software package for Mac. This software is pre-packaged for our systems and already includes built-in configuration. Do not install this software on public or shared computers! Before installing Tunnelblick, you must have enrolled your cell phone for Duo MFA using your CruzID account with UCSC. Most folks already have this from when they first started at UCSC. If you don't yet have a CruzID, please contact your sponsor/PI and ask them to help you acquire a CruzID. If you have a CruzID but haven't yet enrolled your cell phone, please follow the instructions here to enroll your phone: https://its.ucsc.edu/mfa/enroll.html After confirming your cell phone MFA enrollment, or if you have already done this a while ago, continue to install Tunnelblick. You will need to download Tunnelblick from this link: https://giwiki.gi.ucsc.edu/downloads/Tunnelblick.dmg The username and password to access that web link will be sent to you in your account creation welcome email. Once you have downloaded the file, navigate in the Finder to wherever you downloaded it to and double click on it. It should open the .dmg file and you should see the Tunnelblick application icon. Double click on the Tunnelblick icon to install. During installation, it will ask you if you want to install for "Only You" or "All Users". Select "Only You". After installation, in your Finder, you may want to navigate to the Applications folder and drag the Tunnelblick icon to your dock for easy launching. After launching Tunnelblick from the Applications folder, you will see a small "tunnel" icon on the top right of your screen. You should be able to click on that icon, then click "Connect prism" to start the VPN. "Prism" is the name of our firewalled environment. Use the username and temporary password that we sent to you in your account creation welcome email to login to the VPN for the first time. After typing in your username and password, you will be sent a Duo MFA push to your phone. Accept that push, and then you will be connected. Once you authenticate to the VPN (username/password/MFA), then login via SSH to 'mustard.prism' for example, and you will be asked to change your password. If you are not familiar with SSH, then you will need to open the "Terminal" application which can be found in your Applications Folder under "Utilities". After launching "Terminal" you will connect to mustard by typing: ssh username@mustard.prism Where "username" is the username we sent you in the welcome email (incidentally it is also your CruzID username). It will ask you for a password, just type in the password we sent you in your account creation welcome email. When you type the password, the characters '''will not''' echo to the screen, so it will not show you what you are typing. Once logging in successfully to mustard, it will as you to change your password. It will ask for you current password one more time, then it will ask you to choose a new password, which you will need to enter two times. Again, whatever password you choose '''will not''' echo to the screen. You new password must be: 1: At least 10 characters long 2: At least 3 character classes (lowercase, uppercase, number and/or special character) Once you change your password, it will log you out of mustard. Then, log out of the VPN (click the Tunnelblick icon on the top right of your screen and select "disconnect"). Then, log back into the VPN using your '''new''' password. It will send another Duo MFA push to your phone, then you should be logged in! Then feel free to ssh to any of our firewalled servers (using your new password). As always, if you have any questions, please email '''cluster-admin@soe.ucsc.edu''' for help. 39951321180b920381b429b04eef098e3d094608 574 573 2025-02-09T15:47:12Z Weiler 3 wikitext text/x-wiki '''Before''' following these instructions, please ensure that you have filled out an account request form and completed all the training and requirements as detailed here: [[Requirement_for_users_to_get_GI_VPN_access]] After completing those requirements, you should have received a welcome email from us explaining that your account is ready. Once you have received that email, continue follow these instructions. For MacOS, you will be installing "Tunnelblick", an OpenVPN client software package for Mac. This software is pre-packaged for our systems and already includes built-in configuration. Do not install this software on public or shared computers! Before installing Tunnelblick, you must have enrolled your cell phone for Duo MFA using your CruzID account with UCSC. Most folks already have this from when they first started at UCSC. If you don't yet have a CruzID, please contact your sponsor/PI and ask them to help you acquire a CruzID. If you have a CruzID but haven't yet enrolled your cell phone, please follow the instructions here to enroll your phone: https://its.ucsc.edu/mfa/enroll.html After confirming your cell phone MFA enrollment, or if you have already done this a while ago, continue to install Tunnelblick. You will need to download Tunnelblick from this link: https://giwiki.gi.ucsc.edu/downloads/Tunnelblick.dmg The username and password to access that web link will be sent to you in your account creation welcome email. Once you have downloaded the file, navigate in the Finder to wherever you downloaded it to and double click on it. It should open the .dmg file and you should see the Tunnelblick application icon. Double click on the Tunnelblick icon to install. During installation, it will ask you if you want to install for "Only You" or "All Users". Select "Only You". After installation, in your Finder, you may want to navigate to the Applications folder and drag the Tunnelblick icon to your dock for easy launching. After launching Tunnelblick from the Applications folder, you will see a small "tunnel" icon on the top right of your screen. You should be able to click on that icon, then click "Connect prism" to start the VPN. "Prism" is the name of our firewalled environment. Use the username and temporary password that we sent to you in your account creation welcome email to login to the VPN for the first time. After typing in your username and password, you will be sent a Duo MFA push to your phone. Accept that push, and then you will be connected. Once you authenticate to the VPN (username/password/MFA), then login via SSH to 'mustard.prism' for example, and you will be asked to change your password. If you are not familiar with SSH, then you will need to open the "Terminal" application which can be found in your Applications Folder under "Utilities". After launching "Terminal" you will connect to mustard by typing: ssh username@mustard.prism Where "username" is the username we sent you in the welcome email (incidentally it is also your CruzID username). It will ask you for a password, just type in the password we sent you in your account creation welcome email. When you type the password, the characters '''will not''' echo to the screen, so it will not show you what you are typing. Once logging in successfully to mustard, it will as you to change your password. It will ask for you current password one more time, then it will ask you to choose a new password, which you will need to enter two times. Again, whatever password you choose '''will not''' echo to the screen. You new password must be: 1: At least 10 characters long 2: At least 3 character classes (lowercase, uppercase, number and/or special character) Once you change your password, it will log you out of mustard. Then, log out of the VPN (click the Tunnelblick icon on the top right of your screen and select "disconnect"). Then, log back into the VPN using your '''new''' password. It will send another Duo MFA push to your phone, then you should be logged in! Then feel free to ssh to any of our firewalled servers (using your new password). As always, if you have any questions, please email '''cluster-admin@soe.ucsc.edu''' for help. 4cb38a9e5d72e79a9a8bb6ee5d4fe97e61a92d73 575 574 2025-02-09T15:49:42Z Weiler 3 wikitext text/x-wiki '''Before''' following these instructions, please ensure that you have filled out an account request form and completed all the training and requirements as detailed here: [[Requirement_for_users_to_get_GI_VPN_access]] After completing those requirements, you should have received a welcome email from us explaining that your account is ready. Once you have received that email, continue following these instructions. For MacOS, you will be installing "Tunnelblick", an OpenVPN client software package for Mac. This software is pre-packaged for our systems and already includes built-in configuration. Do not install this software on public or shared computers! Before installing Tunnelblick, you must have enrolled your cell phone for Duo MFA using your CruzID account with UCSC. Most folks already have this from when they first started at UCSC. If you don't yet have a CruzID, please contact your sponsor/PI and ask them to help you acquire a CruzID. If you have a CruzID but haven't yet enrolled your cell phone, please follow the instructions here to enroll your phone: https://its.ucsc.edu/mfa/enroll.html After confirming your cell phone MFA enrollment, or if you have already done this a while ago, continue to install Tunnelblick. You will need to download Tunnelblick from this link: https://giwiki.gi.ucsc.edu/downloads/Tunnelblick.dmg The username and password to access that web link will be sent to you in your account creation welcome email. Once you have downloaded the file, navigate in the Finder to wherever you downloaded it to and double click on it. It should open the .dmg file and you should see the Tunnelblick application icon. Double click on the Tunnelblick icon to install. During installation, it will ask you if you want to install for "Only You" or "All Users". Select "Only You". After installation, in your Finder, you may want to navigate to the Applications folder and drag the Tunnelblick icon to your dock for easy launching. After launching Tunnelblick from the Applications folder, you will see a small "tunnel" icon on the top right of your screen. You should be able to click on that icon, then click "Connect prism" to start the VPN. "Prism" is the name of our firewalled environment. Use the username and temporary password that we sent to you in your account creation welcome email to login to the VPN for the first time. After typing in your username and password, you will be sent a Duo MFA push to your phone. Accept that push, and then you will be connected. Once you authenticate to the VPN (username/password/MFA), then login via SSH to 'mustard.prism' for example, and you will be asked to change your password. If you are not familiar with SSH, then you will need to open the "Terminal" application which can be found in your Applications Folder under "Utilities". After launching "Terminal" you will connect to mustard by typing: ssh username@mustard.prism Where "username" is the username we sent you in the welcome email (incidentally it is also your CruzID username). It will ask you for a password, just type in the password we sent you in your account creation welcome email. When you type the password, the characters '''will not''' echo to the screen, so it will not show you what you are typing. Once logging in successfully to mustard, it will as you to change your password. It will ask for you current password one more time, then it will ask you to choose a new password, which you will need to enter two times. Again, whatever password you choose '''will not''' echo to the screen. You new password must be: 1: At least 10 characters long 2: At least 3 character classes (lowercase, uppercase, number and/or special character) Once you change your password, it will log you out of mustard. Then, log out of the VPN (click the Tunnelblick icon on the top right of your screen and select "disconnect"). Then, log back into the VPN using your '''new''' password. It will send another Duo MFA push to your phone, then you should be logged in! Then feel free to ssh to any of our firewalled servers (using your new password). As always, if you have any questions, please email '''cluster-admin@soe.ucsc.edu''' for help. 0b421cbc50a4eff579bc7cbcaec9f5e3c6a3dfe8 576 575 2025-02-09T16:01:55Z Weiler 3 wikitext text/x-wiki '''Before''' following these instructions, please ensure that you have filled out an account request form and completed all the training and requirements as detailed here: [[Requirement_for_users_to_get_GI_VPN_access]] After completing those requirements, you should have received a welcome email from us explaining that your account is ready. Once you have received that email, continue following these instructions. For MacOS, you will be installing "Tunnelblick", an OpenVPN client software package for Mac. This software is pre-packaged for our systems and already includes built-in configuration. Do not install this software on public or shared computers! Before installing Tunnelblick, you must have enrolled your cell phone for Duo MFA using your CruzID account with UCSC. Most folks already have this from when they first started at UCSC. If you don't yet have a CruzID, please contact your sponsor/PI and ask them to help you acquire a CruzID. If you have a CruzID but haven't yet enrolled your cell phone, please follow the instructions here to enroll your phone: https://its.ucsc.edu/mfa/enroll.html After confirming your cell phone MFA enrollment, or if you have already done this a while ago, continue to install Tunnelblick. You will need to download Tunnelblick from this link: https://giwiki.gi.ucsc.edu/downloads/Tunnelblick.dmg The username and password to access that web link will be sent to you in your account creation welcome email. Once you have downloaded the file, navigate in the Finder to wherever you downloaded it to and double click on it. It should open the .dmg file and you should see the Tunnelblick application icon. Double click on the Tunnelblick icon to install. During installation, it will ask you if you want to install for "Only You" or "All Users". Select "Only You". After installation, in your Finder, you may want to navigate to the Applications folder and drag the Tunnelblick icon to your dock for easy launching. After launching Tunnelblick from the Applications folder, you will see a small "tunnel" icon on the top right of your screen, next to the date and WiFi icon. You should be able to click on that icon, then click "Connect prism" to start the VPN. "Prism" is the name of our firewalled environment. Use the username and temporary password that we sent to you in your account creation welcome email to login to the VPN for the first time. After typing in your username and password, you will be sent a Duo MFA push to your phone. Accept that push, and then you will be connected. Once you authenticate to the VPN (username/password/MFA), then login via SSH to 'mustard.prism' for example, and you will be asked to change your password. If you are not familiar with SSH, then you will need to open the "Terminal" application which can be found in your Applications Folder under "Utilities". After launching "Terminal" you will connect to mustard by typing: ssh username@mustard.prism Where "username" is the username we sent you in the welcome email (incidentally it is also your CruzID username). It will ask you for a password, just type in the password we sent you in your account creation welcome email. When you type the password, the characters '''will not''' echo to the screen, so it will not show you what you are typing. Once logging in successfully to mustard, it will as you to change your password. It will ask for you current password one more time, then it will ask you to choose a new password, which you will need to enter two times. Again, whatever password you choose '''will not''' echo to the screen. You new password must be: 1: At least 10 characters long 2: At least 3 character classes (lowercase, uppercase, number and/or special character) Once you change your password, it will log you out of mustard. Then, log out of the VPN (click the Tunnelblick icon on the top right of your screen and select "disconnect"). Then, log back into the VPN using your '''new''' password. It will send another Duo MFA push to your phone, then you should be logged in! Then feel free to ssh to any of our firewalled servers (using your new password). As always, if you have any questions, please email '''cluster-admin@soe.ucsc.edu''' for help. 2e256dad405930979788d7c404221c9e2a52e973 600 576 2025-02-09T21:12:52Z Weiler 3 wikitext text/x-wiki '''Before''' following these instructions, please ensure that you have filled out an account request form and completed all the training and requirements as detailed here: [[Requirement_for_users_to_get_GI_VPN_access]] After completing those requirements, you should have received a welcome email from us explaining that your account is ready. Once you have received that email, continue following these instructions. For MacOS, you will be installing "Tunnelblick", an OpenVPN client software package for Mac. This software is pre-packaged for our systems and already includes built-in configuration. Do not install this software on public or shared computers! Before installing Tunnelblick, you must have enrolled your cell phone for Duo MFA using your CruzID account with UCSC. Most folks already have this from when they first started at UCSC. If you don't yet have a CruzID, please contact your sponsor/PI and ask them to help you acquire a CruzID. If you have a CruzID but haven't yet enrolled your cell phone, please follow the instructions here to enroll your phone: https://its.ucsc.edu/mfa/enroll.html After confirming your cell phone MFA enrollment, or if you have already done this a while ago, continue to install Tunnelblick. You will need to download Tunnelblick from this link: https://giwiki.gi.ucsc.edu/downloads/Tunnelblick.dmg The username and password to access that web link will be sent to you in your account creation welcome email. Once you have downloaded the file, navigate in the Finder to wherever you downloaded it to and double click on it. It should open the .dmg file and you should see the Tunnelblick application icon. Double click on the Tunnelblick icon to install. During installation, it will ask you if you want to install for "Only You" or "All Users". Select "Only You". After installation, in your Finder, you may want to navigate to the Applications folder and drag the Tunnelblick icon to your dock for easy launching. After launching Tunnelblick from the Applications folder, you will see a small "tunnel" icon on the top right of your screen, next to the date and WiFi icon. You should be able to click on that icon, then click "Connect prism" to start the VPN. "Prism" is the name of our firewalled environment. Use the username and temporary password that we sent to you in your account creation welcome email to login to the VPN for the first time. After typing in your username and password, you will be sent a Duo MFA push to your phone. Accept that push, and then you will be connected. Once you authenticate to the VPN (username/password/MFA), then login via SSH to 'mustard.prism' for example, and you will be asked to change your password. If you are not familiar with SSH, then you will need to open the "Terminal" application which can be found in your Applications Folder under "Utilities". After launching "Terminal" you will connect to mustard by typing: ssh username@mustard.prism Where "username" is the username we sent you in the welcome email (incidentally it is also your CruzID username). It will ask you for a password, just type in the password we sent you in your account creation welcome email. When you type the password, the characters '''will not''' echo to the screen, so it will not show you what you are typing. Once logging in successfully to mustard, it will as you to change your password. It will ask for you current password one more time, then it will ask you to choose a new password, which you will need to enter two times. Again, whatever password you choose '''will not''' echo to the screen. You new password must be: 1: At least 10 characters long 2: At least 3 character classes (lowercase, uppercase, number and/or special character) Once you change your password, it will log you out of mustard. Then, log out of the VPN (click the Tunnelblick icon on the top right of your screen and select "disconnect"). Then, log back into the VPN using your '''new''' password. It will send another Duo MFA push to your phone, then you should be logged in! Then feel free to ssh to any of our firewalled servers (using your new password). Note the following page for available resources: https://giwiki.gi.ucsc.edu/index.php?title=Firewalled_Computing_Resources_Overview As always, if you have any questions, please email '''cluster-admin@soe.ucsc.edu''' for help. de506f991127584b93b68c8d3112bc28b8c5cc14 Setting Up The VPN on Linux 0 62 578 2025-02-09T16:13:15Z Weiler 3 Created page with "'''Before''' following these instructions, please ensure that you have filled out an account request form and completed all the training and requirements as detailed here: [[Requirement_for_users_to_get_GI_VPN_access]] After completing those requirements, you should have received a welcome email from us explaining that your account is ready. Once you have received that email, continue following these instructions. Most Linux flavors support OpenVPN client software...." wikitext text/x-wiki '''Before''' following these instructions, please ensure that you have filled out an account request form and completed all the training and requirements as detailed here: [[Requirement_for_users_to_get_GI_VPN_access]] After completing those requirements, you should have received a welcome email from us explaining that your account is ready. Once you have received that email, continue following these instructions. Most Linux flavors support OpenVPN client software. While the installation process may vary from flavor to flavor, we will be describing the process to get you going for Ubuntu, which should work on most Ubuntu versions and other Ubuntu/Debian derivatives. Do not install this software on public or shared computers! Before installing our VPN profile, you must have enrolled your cell phone for Duo MFA using your CruzID account with UCSC. Most folks already have this from when they first started at UCSC. If you don't yet have a CruzID, please contact your sponsor/PI and ask them to help you acquire a CruzID. If you have a CruzID but haven't yet enrolled your cell phone, please follow the instructions here to enroll your phone: https://its.ucsc.edu/mfa/enroll.html After confirming your cell phone MFA enrollment, or if you have already done this a while ago, continue to install our VPN profile. You will need to download our OpenVPN client configuration file from this link: https://giwiki.gi.ucsc.edu/downloads/prism.ovpn The username and password to access that web link will be sent to you in your account creation welcome email. Once you authenticate to the VPN (username/password/MFA), then login via SSH to 'mustard.prism' for example, and you will be asked to change your password. If you are not familiar with SSH, then you will need to open the "Terminal" application which can be found in your Applications Folder under "Utilities". After launching "Terminal" you will connect to mustard by typing: ssh username@mustard.prism Where "username" is the username we sent you in the welcome email (incidentally it is also your CruzID username). It will ask you for a password, just type in the password we sent you in your account creation welcome email. When you type the password, the characters '''will not''' echo to the screen, so it will not show you what you are typing. Once logging in successfully to mustard, it will as you to change your password. It will ask for you current password one more time, then it will ask you to choose a new password, which you will need to enter two times. Again, whatever password you choose '''will not''' echo to the screen. You new password must be: 1: At least 10 characters long 2: At least 3 character classes (lowercase, uppercase, number and/or special character) Once you change your password, it will log you out of mustard. Then, log out of the VPN (click the Tunnelblick icon on the top right of your screen and select "disconnect"). Then, log back into the VPN using your '''new''' password. It will send another Duo MFA push to your phone, then you should be logged in! Then feel free to ssh to any of our firewalled servers (using your new password). As always, if you have any questions, please email '''cluster-admin@soe.ucsc.edu''' for help. b25de05be62fd95712f5d16212c31af7cb16ab76 579 578 2025-02-09T16:17:29Z Weiler 3 wikitext text/x-wiki '''Before''' following these instructions, please ensure that you have filled out an account request form and completed all the training and requirements as detailed here: [[Requirement_for_users_to_get_GI_VPN_access]] After completing those requirements, you should have received a welcome email from us explaining that your account is ready. Once you have received that email, continue following these instructions. Most Linux flavors support OpenVPN client software. While the installation process may vary from flavor to flavor, we will be describing the process to get you going for Ubuntu, which should work on most Ubuntu versions and other Ubuntu/Debian derivatives. Do not install this software on public or shared computers! Before installing our VPN profile, you must have enrolled your cell phone for Duo MFA using your CruzID account with UCSC. Most folks already have this from when they first started at UCSC. If you don't yet have a CruzID, please contact your sponsor/PI and ask them to help you acquire a CruzID. If you have a CruzID but haven't yet enrolled your cell phone, please follow the instructions here to enroll your phone: https://its.ucsc.edu/mfa/enroll.html After confirming your cell phone MFA enrollment, or if you have already done this a while ago, continue to install our VPN profile. You will need to download our OpenVPN client configuration file from this link: https://giwiki.gi.ucsc.edu/downloads/prism.ovpn The username and password to access that web link will be sent to you in your account creation welcome email. We will be installing the Prism VPN profile via the Network Manager GUI interface. Open '''Network Manager''' from '''Gnome settings''' option and select '''Network''' tab and click on the '''VPN +''' symbol: Once you authenticate to the VPN (username/password/MFA), then login via SSH to 'mustard.prism' for example, and you will be asked to change your password. If you are not familiar with SSH, then you will need to open the "Terminal" application which can be found in your Applications Folder under "Utilities". After launching "Terminal" you will connect to mustard by typing: ssh username@mustard.prism Where "username" is the username we sent you in the welcome email (incidentally it is also your CruzID username). It will ask you for a password, just type in the password we sent you in your account creation welcome email. When you type the password, the characters '''will not''' echo to the screen, so it will not show you what you are typing. Once logging in successfully to mustard, it will as you to change your password. It will ask for you current password one more time, then it will ask you to choose a new password, which you will need to enter two times. Again, whatever password you choose '''will not''' echo to the screen. You new password must be: 1: At least 10 characters long 2: At least 3 character classes (lowercase, uppercase, number and/or special character) Once you change your password, it will log you out of mustard. Then, log out of the VPN (click the Tunnelblick icon on the top right of your screen and select "disconnect"). Then, log back into the VPN using your '''new''' password. It will send another Duo MFA push to your phone, then you should be logged in! Then feel free to ssh to any of our firewalled servers (using your new password). As always, if you have any questions, please email '''cluster-admin@soe.ucsc.edu''' for help. 040062acc9f122db8fd1eda20e6d6dcabf405ebe 580 579 2025-02-09T16:17:57Z Weiler 3 wikitext text/x-wiki '''Before''' following these instructions, please ensure that you have filled out an account request form and completed all the training and requirements as detailed here: [[Requirement_for_users_to_get_GI_VPN_access]] After completing those requirements, you should have received a welcome email from us explaining that your account is ready. Once you have received that email, continue following these instructions. Most Linux flavors support OpenVPN client software. While the installation process may vary from flavor to flavor, we will be describing the process to get you going for Ubuntu, which should work on most Ubuntu versions and other Ubuntu/Debian derivatives. Do not install this software on public or shared computers! Before installing our VPN profile, you must have enrolled your cell phone for Duo MFA using your CruzID account with UCSC. Most folks already have this from when they first started at UCSC. If you don't yet have a CruzID, please contact your sponsor/PI and ask them to help you acquire a CruzID. If you have a CruzID but haven't yet enrolled your cell phone, please follow the instructions here to enroll your phone: https://its.ucsc.edu/mfa/enroll.html After confirming your cell phone MFA enrollment, or if you have already done this a while ago, continue to install our VPN profile. You will need to download our OpenVPN client configuration file from this link: https://giwiki.gi.ucsc.edu/downloads/prism.ovpn The username and password to access that web link will be sent to you in your account creation welcome email. We will be installing the Prism VPN profile via the Network Manager GUI interface. Open '''Network Manager''' from '''Gnome Settings''' option and select the '''Network''' tab and click on the '''VPN +''' symbol: Once you authenticate to the VPN (username/password/MFA), then login via SSH to 'mustard.prism' for example, and you will be asked to change your password. If you are not familiar with SSH, then you will need to open the "Terminal" application which can be found in your Applications Folder under "Utilities". After launching "Terminal" you will connect to mustard by typing: ssh username@mustard.prism Where "username" is the username we sent you in the welcome email (incidentally it is also your CruzID username). It will ask you for a password, just type in the password we sent you in your account creation welcome email. When you type the password, the characters '''will not''' echo to the screen, so it will not show you what you are typing. Once logging in successfully to mustard, it will as you to change your password. It will ask for you current password one more time, then it will ask you to choose a new password, which you will need to enter two times. Again, whatever password you choose '''will not''' echo to the screen. You new password must be: 1: At least 10 characters long 2: At least 3 character classes (lowercase, uppercase, number and/or special character) Once you change your password, it will log you out of mustard. Then, log out of the VPN (click the Tunnelblick icon on the top right of your screen and select "disconnect"). Then, log back into the VPN using your '''new''' password. It will send another Duo MFA push to your phone, then you should be logged in! Then feel free to ssh to any of our firewalled servers (using your new password). As always, if you have any questions, please email '''cluster-admin@soe.ucsc.edu''' for help. edc73beee8e6086dc04c0518513617474e3a381d 581 580 2025-02-09T16:19:36Z Weiler 3 wikitext text/x-wiki '''Before''' following these instructions, please ensure that you have filled out an account request form and completed all the training and requirements as detailed here: [[Requirement_for_users_to_get_GI_VPN_access]] After completing those requirements, you should have received a welcome email from us explaining that your account is ready. Once you have received that email, continue following these instructions. Most Linux flavors support OpenVPN client software. While the installation process may vary from flavor to flavor, we will be describing the process to get you going for Ubuntu, which should work on most Ubuntu versions and other Ubuntu/Debian derivatives. Do not install this software on public or shared computers! Before installing our VPN profile, you must have enrolled your cell phone for Duo MFA using your CruzID account with UCSC. Most folks already have this from when they first started at UCSC. If you don't yet have a CruzID, please contact your sponsor/PI and ask them to help you acquire a CruzID. If you have a CruzID but haven't yet enrolled your cell phone, please follow the instructions here to enroll your phone: https://its.ucsc.edu/mfa/enroll.html After confirming your cell phone MFA enrollment, or if you have already done this a while ago, continue to install our VPN profile. You will need to download our OpenVPN client configuration file from this link: https://giwiki.gi.ucsc.edu/downloads/prism.ovpn The username and password to access that web link will be sent to you in your account creation welcome email. We will be installing the Prism VPN profile via the Network Manager GUI interface. Open '''Network Manager''' from '''Gnome Settings''' option and select the '''Network''' tab and click on the '''VPN +''' symbol: [[File:Keypairs.png|900px]] Once you authenticate to the VPN (username/password/MFA), then login via SSH to 'mustard.prism' for example, and you will be asked to change your password. If you are not familiar with SSH, then you will need to open the "Terminal" application which can be found in your Applications Folder under "Utilities". After launching "Terminal" you will connect to mustard by typing: ssh username@mustard.prism Where "username" is the username we sent you in the welcome email (incidentally it is also your CruzID username). It will ask you for a password, just type in the password we sent you in your account creation welcome email. When you type the password, the characters '''will not''' echo to the screen, so it will not show you what you are typing. Once logging in successfully to mustard, it will as you to change your password. It will ask for you current password one more time, then it will ask you to choose a new password, which you will need to enter two times. Again, whatever password you choose '''will not''' echo to the screen. You new password must be: 1: At least 10 characters long 2: At least 3 character classes (lowercase, uppercase, number and/or special character) Once you change your password, it will log you out of mustard. Then, log out of the VPN (click the Tunnelblick icon on the top right of your screen and select "disconnect"). Then, log back into the VPN using your '''new''' password. It will send another Duo MFA push to your phone, then you should be logged in! Then feel free to ssh to any of our firewalled servers (using your new password). As always, if you have any questions, please email '''cluster-admin@soe.ucsc.edu''' for help. 482d80f2fbec2ff43a56528ecc1250eed36092dc 586 581 2025-02-09T16:24:00Z Weiler 3 wikitext text/x-wiki '''Before''' following these instructions, please ensure that you have filled out an account request form and completed all the training and requirements as detailed here: [[Requirement_for_users_to_get_GI_VPN_access]] After completing those requirements, you should have received a welcome email from us explaining that your account is ready. Once you have received that email, continue following these instructions. Most Linux flavors support OpenVPN client software. While the installation process may vary from flavor to flavor, we will be describing the process to get you going for Ubuntu, which should work on most Ubuntu versions and other Ubuntu/Debian derivatives. Do not install this software on public or shared computers! Before installing our VPN profile, you must have enrolled your cell phone for Duo MFA using your CruzID account with UCSC. Most folks already have this from when they first started at UCSC. If you don't yet have a CruzID, please contact your sponsor/PI and ask them to help you acquire a CruzID. If you have a CruzID but haven't yet enrolled your cell phone, please follow the instructions here to enroll your phone: https://its.ucsc.edu/mfa/enroll.html After confirming your cell phone MFA enrollment, or if you have already done this a while ago, continue to install our VPN profile. You will need to download our OpenVPN client configuration file from this link: https://giwiki.gi.ucsc.edu/downloads/prism.ovpn The username and password to access that web link will be sent to you in your account creation welcome email. We will be installing the Prism VPN profile via the Network Manager GUI interface. Open '''Network Manager''' from '''Gnome Settings''' option and select the '''Network''' tab and click on the '''VPN +''' symbol: [[File:Configuring_1.png|900px]] Once you authenticate to the VPN (username/password/MFA), then login via SSH to 'mustard.prism' for example, and you will be asked to change your password. If you are not familiar with SSH, then you will need to open the "Terminal" application which can be found in your Applications Folder under "Utilities". After launching "Terminal" you will connect to mustard by typing: ssh username@mustard.prism Where "username" is the username we sent you in the welcome email (incidentally it is also your CruzID username). It will ask you for a password, just type in the password we sent you in your account creation welcome email. When you type the password, the characters '''will not''' echo to the screen, so it will not show you what you are typing. Once logging in successfully to mustard, it will as you to change your password. It will ask for you current password one more time, then it will ask you to choose a new password, which you will need to enter two times. Again, whatever password you choose '''will not''' echo to the screen. You new password must be: 1: At least 10 characters long 2: At least 3 character classes (lowercase, uppercase, number and/or special character) Once you change your password, it will log you out of mustard. Then, log out of the VPN (click the Tunnelblick icon on the top right of your screen and select "disconnect"). Then, log back into the VPN using your '''new''' password. It will send another Duo MFA push to your phone, then you should be logged in! Then feel free to ssh to any of our firewalled servers (using your new password). As always, if you have any questions, please email '''cluster-admin@soe.ucsc.edu''' for help. 3321a5abb6c24ec1503d782a2e5d6679d617ee17 587 586 2025-02-09T16:24:16Z Weiler 3 wikitext text/x-wiki '''Before''' following these instructions, please ensure that you have filled out an account request form and completed all the training and requirements as detailed here: [[Requirement_for_users_to_get_GI_VPN_access]] After completing those requirements, you should have received a welcome email from us explaining that your account is ready. Once you have received that email, continue following these instructions. Most Linux flavors support OpenVPN client software. While the installation process may vary from flavor to flavor, we will be describing the process to get you going for Ubuntu, which should work on most Ubuntu versions and other Ubuntu/Debian derivatives. Do not install this software on public or shared computers! Before installing our VPN profile, you must have enrolled your cell phone for Duo MFA using your CruzID account with UCSC. Most folks already have this from when they first started at UCSC. If you don't yet have a CruzID, please contact your sponsor/PI and ask them to help you acquire a CruzID. If you have a CruzID but haven't yet enrolled your cell phone, please follow the instructions here to enroll your phone: https://its.ucsc.edu/mfa/enroll.html After confirming your cell phone MFA enrollment, or if you have already done this a while ago, continue to install our VPN profile. You will need to download our OpenVPN client configuration file from this link: https://giwiki.gi.ucsc.edu/downloads/prism.ovpn The username and password to access that web link will be sent to you in your account creation welcome email. We will be installing the Prism VPN profile via the Network Manager GUI interface. Open '''Network Manager''' from '''Gnome Settings''' option and select the '''Network''' tab and click on the '''VPN +''' symbol: [[File:Configuring_1.png|600px]] Once you authenticate to the VPN (username/password/MFA), then login via SSH to 'mustard.prism' for example, and you will be asked to change your password. If you are not familiar with SSH, then you will need to open the "Terminal" application which can be found in your Applications Folder under "Utilities". After launching "Terminal" you will connect to mustard by typing: ssh username@mustard.prism Where "username" is the username we sent you in the welcome email (incidentally it is also your CruzID username). It will ask you for a password, just type in the password we sent you in your account creation welcome email. When you type the password, the characters '''will not''' echo to the screen, so it will not show you what you are typing. Once logging in successfully to mustard, it will as you to change your password. It will ask for you current password one more time, then it will ask you to choose a new password, which you will need to enter two times. Again, whatever password you choose '''will not''' echo to the screen. You new password must be: 1: At least 10 characters long 2: At least 3 character classes (lowercase, uppercase, number and/or special character) Once you change your password, it will log you out of mustard. Then, log out of the VPN (click the Tunnelblick icon on the top right of your screen and select "disconnect"). Then, log back into the VPN using your '''new''' password. It will send another Duo MFA push to your phone, then you should be logged in! Then feel free to ssh to any of our firewalled servers (using your new password). As always, if you have any questions, please email '''cluster-admin@soe.ucsc.edu''' for help. 50c1709fc1876feb05b3ece61447acfc656369d7 588 587 2025-02-09T16:25:18Z Weiler 3 wikitext text/x-wiki '''Before''' following these instructions, please ensure that you have filled out an account request form and completed all the training and requirements as detailed here: [[Requirement_for_users_to_get_GI_VPN_access]] After completing those requirements, you should have received a welcome email from us explaining that your account is ready. Once you have received that email, continue following these instructions. Most Linux flavors support OpenVPN client software. While the installation process may vary from flavor to flavor, we will be describing the process to get you going for Ubuntu, which should work on most Ubuntu versions and other Ubuntu/Debian derivatives. Do not install this software on public or shared computers! Before installing our VPN profile, you must have enrolled your cell phone for Duo MFA using your CruzID account with UCSC. Most folks already have this from when they first started at UCSC. If you don't yet have a CruzID, please contact your sponsor/PI and ask them to help you acquire a CruzID. If you have a CruzID but haven't yet enrolled your cell phone, please follow the instructions here to enroll your phone: https://its.ucsc.edu/mfa/enroll.html After confirming your cell phone MFA enrollment, or if you have already done this a while ago, continue to install our VPN profile. You will need to download our OpenVPN client configuration file from this link: https://giwiki.gi.ucsc.edu/downloads/prism.ovpn The username and password to access that web link will be sent to you in your account creation welcome email. Save the "prism.ovpn" file locally to your desktop or somewhere else convenient. We will be installing the Prism VPN profile via the Network Manager GUI interface. Open '''Network Manager''' from '''Gnome Settings''' option and select the '''Network''' tab and click on the '''VPN +''' symbol: [[File:Configuring_1.png|600px]] Once you authenticate to the VPN (username/password/MFA), then login via SSH to 'mustard.prism' for example, and you will be asked to change your password. If you are not familiar with SSH, then you will need to open the "Terminal" application which can be found in your Applications Folder under "Utilities". After launching "Terminal" you will connect to mustard by typing: ssh username@mustard.prism Where "username" is the username we sent you in the welcome email (incidentally it is also your CruzID username). It will ask you for a password, just type in the password we sent you in your account creation welcome email. When you type the password, the characters '''will not''' echo to the screen, so it will not show you what you are typing. Once logging in successfully to mustard, it will as you to change your password. It will ask for you current password one more time, then it will ask you to choose a new password, which you will need to enter two times. Again, whatever password you choose '''will not''' echo to the screen. You new password must be: 1: At least 10 characters long 2: At least 3 character classes (lowercase, uppercase, number and/or special character) Once you change your password, it will log you out of mustard. Then, log out of the VPN (click the Tunnelblick icon on the top right of your screen and select "disconnect"). Then, log back into the VPN using your '''new''' password. It will send another Duo MFA push to your phone, then you should be logged in! Then feel free to ssh to any of our firewalled servers (using your new password). As always, if you have any questions, please email '''cluster-admin@soe.ucsc.edu''' for help. ddbb7a53785522a95fe73ce6f6f3cef2a77df074 589 588 2025-02-09T16:26:14Z Weiler 3 wikitext text/x-wiki '''Before''' following these instructions, please ensure that you have filled out an account request form and completed all the training and requirements as detailed here: [[Requirement_for_users_to_get_GI_VPN_access]] After completing those requirements, you should have received a welcome email from us explaining that your account is ready. Once you have received that email, continue following these instructions. Most Linux flavors support OpenVPN client software. While the installation process may vary from flavor to flavor, we will be describing the process to get you going for Ubuntu, which should work on most Ubuntu versions and other Ubuntu/Debian derivatives. Do not install this software on public or shared computers! Before installing our VPN profile, you must have enrolled your cell phone for Duo MFA using your CruzID account with UCSC. Most folks already have this from when they first started at UCSC. If you don't yet have a CruzID, please contact your sponsor/PI and ask them to help you acquire a CruzID. If you have a CruzID but haven't yet enrolled your cell phone, please follow the instructions here to enroll your phone: https://its.ucsc.edu/mfa/enroll.html After confirming your cell phone MFA enrollment, or if you have already done this a while ago, continue to install our VPN profile. You will need to download our OpenVPN client configuration file from this link: https://giwiki.gi.ucsc.edu/downloads/prism.ovpn The username and password to access that web link will be sent to you in your account creation welcome email. Save the "prism.ovpn" file locally to your desktop or somewhere else convenient. We will be installing the Prism VPN profile via the Network Manager GUI interface. Open '''Network Manager''' from '''Gnome Settings''' option and select the '''Network''' tab and click on the '''VPN +''' symbol: [[File:Configuring_1.png|600px]] From the '''Add VPN''' windows, click on the '''Import from file…''' option: Once you authenticate to the VPN (username/password/MFA), then login via SSH to 'mustard.prism' for example, and you will be asked to change your password. If you are not familiar with SSH, then you will need to open the "Terminal" application which can be found in your Applications Folder under "Utilities". After launching "Terminal" you will connect to mustard by typing: ssh username@mustard.prism Where "username" is the username we sent you in the welcome email (incidentally it is also your CruzID username). It will ask you for a password, just type in the password we sent you in your account creation welcome email. When you type the password, the characters '''will not''' echo to the screen, so it will not show you what you are typing. Once logging in successfully to mustard, it will as you to change your password. It will ask for you current password one more time, then it will ask you to choose a new password, which you will need to enter two times. Again, whatever password you choose '''will not''' echo to the screen. You new password must be: 1: At least 10 characters long 2: At least 3 character classes (lowercase, uppercase, number and/or special character) Once you change your password, it will log you out of mustard. Then, log out of the VPN (click the Tunnelblick icon on the top right of your screen and select "disconnect"). Then, log back into the VPN using your '''new''' password. It will send another Duo MFA push to your phone, then you should be logged in! Then feel free to ssh to any of our firewalled servers (using your new password). As always, if you have any questions, please email '''cluster-admin@soe.ucsc.edu''' for help. bcb6daa26cd63b8f5c9560dc4c732f7df64dbeba 591 589 2025-02-09T16:28:20Z Weiler 3 wikitext text/x-wiki '''Before''' following these instructions, please ensure that you have filled out an account request form and completed all the training and requirements as detailed here: [[Requirement_for_users_to_get_GI_VPN_access]] After completing those requirements, you should have received a welcome email from us explaining that your account is ready. Once you have received that email, continue following these instructions. Most Linux flavors support OpenVPN client software. While the installation process may vary from flavor to flavor, we will be describing the process to get you going for Ubuntu, which should work on most Ubuntu versions and other Ubuntu/Debian derivatives. Do not install this software on public or shared computers! Before installing our VPN profile, you must have enrolled your cell phone for Duo MFA using your CruzID account with UCSC. Most folks already have this from when they first started at UCSC. If you don't yet have a CruzID, please contact your sponsor/PI and ask them to help you acquire a CruzID. If you have a CruzID but haven't yet enrolled your cell phone, please follow the instructions here to enroll your phone: https://its.ucsc.edu/mfa/enroll.html After confirming your cell phone MFA enrollment, or if you have already done this a while ago, continue to install our VPN profile. You will need to download our OpenVPN client configuration file from this link: https://giwiki.gi.ucsc.edu/downloads/prism.ovpn The username and password to access that web link will be sent to you in your account creation welcome email. Save the "prism.ovpn" file locally to your desktop or somewhere else convenient. We will be installing the Prism VPN profile via the Network Manager GUI interface. Open '''Network Manager''' from '''Gnome Settings''' option and select the '''Network''' tab and click on the '''VPN +''' symbol: [[File:Configuring_1.png|600px]] From the '''Add VPN''' windows, click on the '''Import from file…''' option: [[File:Configuring_2.png|600px]] Once you authenticate to the VPN (username/password/MFA), then login via SSH to 'mustard.prism' for example, and you will be asked to change your password. If you are not familiar with SSH, then you will need to open the "Terminal" application which can be found in your Applications Folder under "Utilities". After launching "Terminal" you will connect to mustard by typing: ssh username@mustard.prism Where "username" is the username we sent you in the welcome email (incidentally it is also your CruzID username). It will ask you for a password, just type in the password we sent you in your account creation welcome email. When you type the password, the characters '''will not''' echo to the screen, so it will not show you what you are typing. Once logging in successfully to mustard, it will as you to change your password. It will ask for you current password one more time, then it will ask you to choose a new password, which you will need to enter two times. Again, whatever password you choose '''will not''' echo to the screen. You new password must be: 1: At least 10 characters long 2: At least 3 character classes (lowercase, uppercase, number and/or special character) Once you change your password, it will log you out of mustard. Then, log out of the VPN (click the Tunnelblick icon on the top right of your screen and select "disconnect"). Then, log back into the VPN using your '''new''' password. It will send another Duo MFA push to your phone, then you should be logged in! Then feel free to ssh to any of our firewalled servers (using your new password). As always, if you have any questions, please email '''cluster-admin@soe.ucsc.edu''' for help. 188918cc7dace67300a3b36599a3099498a4649d 595 591 2025-02-09T16:35:00Z Weiler 3 wikitext text/x-wiki '''Before''' following these instructions, please ensure that you have filled out an account request form and completed all the training and requirements as detailed here: [[Requirement_for_users_to_get_GI_VPN_access]] After completing those requirements, you should have received a welcome email from us explaining that your account is ready. Once you have received that email, continue following these instructions. Most Linux flavors support OpenVPN client software. While the installation process may vary from flavor to flavor, we will be describing the process to get you going for Ubuntu, which should work on most Ubuntu versions and other Ubuntu/Debian derivatives. Do not install this software on public or shared computers! Before installing our VPN profile, you must have enrolled your cell phone for Duo MFA using your CruzID account with UCSC. Most folks already have this from when they first started at UCSC. If you don't yet have a CruzID, please contact your sponsor/PI and ask them to help you acquire a CruzID. If you have a CruzID but haven't yet enrolled your cell phone, please follow the instructions here to enroll your phone: https://its.ucsc.edu/mfa/enroll.html After confirming your cell phone MFA enrollment, or if you have already done this a while ago, continue to install our VPN profile. You will need to download our OpenVPN client configuration file from this link: https://giwiki.gi.ucsc.edu/downloads/prism.ovpn The username and password to access that web link will be sent to you in your account creation welcome email. Save the "prism.ovpn" file locally to your desktop or somewhere else convenient. We will be installing the Prism VPN profile via the Network Manager GUI interface. Open '''Network Manager''' from '''Gnome Settings''' option and select the '''Network''' tab and click on the '''VPN +''' symbol: [[File:Configuring_1.png|600px]] From the '''Add VPN''' window, click on the '''Import from file...''' option: [[File:Configuring_2.png|600px]] You must navigate to your .ovpn file (/path/to/your/prism.ovpn) and click on '''Open''' button: [[File:Configuring_3.png|600px]] Click on the '''Add''' button: [[File:Configuring_4.png|600px]] Finally, click the '''On/Off''' button to start on the VPN: [[File:Configuring_5.png|600px]] Once you authenticate to the VPN (username/password/MFA), then login via SSH to 'mustard.prism' for example, and you will be asked to change your password. If you are not familiar with SSH, then you will need to open the "Terminal" application which can be found in your Applications Folder under "Utilities". After launching "Terminal" you will connect to mustard by typing: ssh username@mustard.prism Where "username" is the username we sent you in the welcome email (incidentally it is also your CruzID username). It will ask you for a password, just type in the password we sent you in your account creation welcome email. When you type the password, the characters '''will not''' echo to the screen, so it will not show you what you are typing. Once logging in successfully to mustard, it will as you to change your password. It will ask for you current password one more time, then it will ask you to choose a new password, which you will need to enter two times. Again, whatever password you choose '''will not''' echo to the screen. You new password must be: 1: At least 10 characters long 2: At least 3 character classes (lowercase, uppercase, number and/or special character) Once you change your password, it will log you out of mustard. Then, log out of the VPN (click the Tunnelblick icon on the top right of your screen and select "disconnect"). Then, log back into the VPN using your '''new''' password. It will send another Duo MFA push to your phone, then you should be logged in! Then feel free to ssh to any of our firewalled servers (using your new password). As always, if you have any questions, please email '''cluster-admin@soe.ucsc.edu''' for help. 7d2923e2115e0316311b2f65c5c04162903b7d08 596 595 2025-02-09T16:37:17Z Weiler 3 wikitext text/x-wiki '''Before''' following these instructions, please ensure that you have filled out an account request form and completed all the training and requirements as detailed here: [[Requirement_for_users_to_get_GI_VPN_access]] After completing those requirements, you should have received a welcome email from us explaining that your account is ready. Once you have received that email, continue following these instructions. Most Linux flavors support OpenVPN client software. While the installation process may vary from flavor to flavor, we will be describing the process to get you going for Ubuntu, which should work on most Ubuntu versions and other Ubuntu/Debian derivatives. Do not install this software on public or shared computers! Before installing our VPN profile, you must have enrolled your cell phone for Duo MFA using your CruzID account with UCSC. Most folks already have this from when they first started at UCSC. If you don't yet have a CruzID, please contact your sponsor/PI and ask them to help you acquire a CruzID. If you have a CruzID but haven't yet enrolled your cell phone, please follow the instructions here to enroll your phone: https://its.ucsc.edu/mfa/enroll.html After confirming your cell phone MFA enrollment, or if you have already done this a while ago, continue to install our VPN profile. You will need to download our OpenVPN client configuration file from this link: https://giwiki.gi.ucsc.edu/downloads/prism.ovpn The username and password to access that web link will be sent to you in your account creation welcome email. Save the "prism.ovpn" file locally to your desktop or somewhere else convenient. We will be installing the Prism VPN profile via the Network Manager GUI interface. Open '''Network Manager''' from '''Gnome Settings''' option and select the '''Network''' tab and click on the '''VPN +''' symbol: [[File:Configuring_1.png|600px]] From the '''Add VPN''' window, click on the '''Import from file...''' option: [[File:Configuring_2.png|600px]] You must navigate to your .ovpn file (/path/to/your/prism.ovpn) and click on '''Open''' button: [[File:Configuring_3.png|600px]] Click on the '''Add''' button: [[File:Configuring_4.png|600px]] Finally, click the '''On/Off''' button to start on the VPN: [[File:Configuring_5.png|600px]] Once you authenticate to the VPN (username/password/MFA), then login via SSH to 'mustard.prism' for example, and you will be asked to change your password. ssh username@mustard.prism Where "username" is the username we sent you in the welcome email (incidentally it is also your CruzID username). It will ask you for a password, just type in the password we sent you in your account creation welcome email. When you type the password, the characters '''will not''' echo to the screen, so it will not show you what you are typing. Once logging in successfully to mustard, it will as you to change your password. It will ask for you current password one more time, then it will ask you to choose a new password, which you will need to enter two times. Again, whatever password you choose '''will not''' echo to the screen. You new password must be: 1: At least 10 characters long 2: At least 3 character classes (lowercase, uppercase, number and/or special character) Once you change your password, it will log you out of mustard. Then, log out of the VPN (toggle the '''On/Off''' button from the Network Manager GUI VPN interface). Then, log back into the VPN using your '''new''' password. It will send another Duo MFA push to your phone, then you should be logged in! Then feel free to ssh to any of our firewalled servers (using your new password). As always, if you have any questions, please email '''cluster-admin@soe.ucsc.edu''' for help. a64f5f98c679750eb7e6156884e8168e7e480afc 602 596 2025-02-09T21:13:27Z Weiler 3 wikitext text/x-wiki '''Before''' following these instructions, please ensure that you have filled out an account request form and completed all the training and requirements as detailed here: [[Requirement_for_users_to_get_GI_VPN_access]] After completing those requirements, you should have received a welcome email from us explaining that your account is ready. Once you have received that email, continue following these instructions. Most Linux flavors support OpenVPN client software. While the installation process may vary from flavor to flavor, we will be describing the process to get you going for Ubuntu, which should work on most Ubuntu versions and other Ubuntu/Debian derivatives. Do not install this software on public or shared computers! Before installing our VPN profile, you must have enrolled your cell phone for Duo MFA using your CruzID account with UCSC. Most folks already have this from when they first started at UCSC. If you don't yet have a CruzID, please contact your sponsor/PI and ask them to help you acquire a CruzID. If you have a CruzID but haven't yet enrolled your cell phone, please follow the instructions here to enroll your phone: https://its.ucsc.edu/mfa/enroll.html After confirming your cell phone MFA enrollment, or if you have already done this a while ago, continue to install our VPN profile. You will need to download our OpenVPN client configuration file from this link: https://giwiki.gi.ucsc.edu/downloads/prism.ovpn The username and password to access that web link will be sent to you in your account creation welcome email. Save the "prism.ovpn" file locally to your desktop or somewhere else convenient. We will be installing the Prism VPN profile via the Network Manager GUI interface. Open '''Network Manager''' from '''Gnome Settings''' option and select the '''Network''' tab and click on the '''VPN +''' symbol: [[File:Configuring_1.png|600px]] From the '''Add VPN''' window, click on the '''Import from file...''' option: [[File:Configuring_2.png|600px]] You must navigate to your .ovpn file (/path/to/your/prism.ovpn) and click on '''Open''' button: [[File:Configuring_3.png|600px]] Click on the '''Add''' button: [[File:Configuring_4.png|600px]] Finally, click the '''On/Off''' button to start on the VPN: [[File:Configuring_5.png|600px]] Once you authenticate to the VPN (username/password/MFA), then login via SSH to 'mustard.prism' for example, and you will be asked to change your password. ssh username@mustard.prism Where "username" is the username we sent you in the welcome email (incidentally it is also your CruzID username). It will ask you for a password, just type in the password we sent you in your account creation welcome email. When you type the password, the characters '''will not''' echo to the screen, so it will not show you what you are typing. Once logging in successfully to mustard, it will as you to change your password. It will ask for you current password one more time, then it will ask you to choose a new password, which you will need to enter two times. Again, whatever password you choose '''will not''' echo to the screen. You new password must be: 1: At least 10 characters long 2: At least 3 character classes (lowercase, uppercase, number and/or special character) Once you change your password, it will log you out of mustard. Then, log out of the VPN (toggle the '''On/Off''' button from the Network Manager GUI VPN interface). Then, log back into the VPN using your '''new''' password. It will send another Duo MFA push to your phone, then you should be logged in! Then feel free to ssh to any of our firewalled servers (using your new password). Note the following page for available resources: https://giwiki.gi.ucsc.edu/index.php?title=Firewalled_Computing_Resources_Overview As always, if you have any questions, please email '''cluster-admin@soe.ucsc.edu''' for help. 42bd162dd74df0f89f55ab6730927a5ec1a7327e File:Configuring 1.png 6 63 582 2025-02-09T16:20:45Z Weiler 3 wikitext text/x-wiki da39a3ee5e6b4b0d3255bfef95601890afd80709 Quick Start Instructions to Get Rolling with OpenStack 0 26 583 331 2025-02-09T16:21:05Z Weiler 3 /* Upload your SSH Public Key */ wikitext text/x-wiki __TOC__ ==Request an OpenStack Account== Once you have [http://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access PRISM/GI VPN access], you can request an OpenStack account. You will need to send an email to cluster-admin@soe.ucsc.edu asking for access, and let us know which lab you are in, or who your PI is, so we can place you in the right OpenStack group. ==Create a SSH Public/Private Keypair== To log into an OpenStack VM instance, you will need a SSH public key. The key is "injected" into the instance upon creation, and only someone with that key (i.e. you) will be able to log in via SSH initially. If you already have a SSH public and private key that you use elsewhere, you can use that one, and can skip to the next step. If you don't have a SSH keypair set up yet, then you will need to log into the UNIX compatible machine you will be '''logging in from''' (a Mac/Apple computer will also work), and then run the 'ssh-keygen' command. If you are behind the VPN, you can first log into mustard, crimson or razzmatazz, which are linux servers. The command will look something like this: $ ssh-keygen -t rsa Generating public/private rsa key pair. Enter file in which to save the key (/public/home/frank/.ssh/id_rsa): Created directory '/public/home/frank/.ssh'. Enter passphrase (empty for no passphrase): [JUST HIT ENTER] Enter same passphrase again: [JUST HIT ENTER] Your identification has been saved in /public/home/frank/.ssh/id_rsa. Your public key has been saved in /public/home/frank/.ssh/id_rsa.pub. The key fingerprint is: SHA256:dhJG1A3gcwj7Mz17ommt3NIczMVVgrzp8Tf6F1X4jpI The key's randomart image is: +---[RSA 2048]----+ | ..+o.+ ..o.| | = .. + o..| | . * .. + ..| | o = * o| | So+o + o.| | . =+oE ooo| | +o.....o| | .o++o . .| | .=o. ...| +----[SHA256]-----+ You will then have a new directory, "~/.ssh", and inside that directory you will have a file called "id_rsa.pub". That is your SSH public key. You will need this in the next step in order to set up your key in OpenStack. ==Log In To giCloud== Once you have been notified that your account has been set up and have been given login credentials, connect to the VPN and then go to this link in your favorite web browser, which is the login page: http://gicloud.prism To login, enter your username and password. Also you will see a "Domain" field, just enter the word "default" for the domain. Click "Log In". You will be logged into your group's summary page. ==Upload your SSH Public Key== After creating your new key in the above "Create a SSH Public/Private Keypair" step, you will need to upload that key into OpenStack. Once you are logged in, on the left hand navigation menu, click "Project", then in the submenu, select "Compute", and finally select "Key Pairs". It should take you to the "Key Pairs" window as shown here. [[File:Configuring_1.png|900px]] Next click the "Import Public Key" button on the top right of the window. In the resulting window, name your key in the "Key Pair Name" field. Name it something descriptive like "laptop-key" if the key is on your laptop, or "mustard-key" if you are logged into mustard, etc. '''Your key must be an RSA key!''' The newer ED25519 keys '''do not work''' with our version of OpenStack. To get your key, open a terminal window and type "cat ~/.ssh/id_rsa.pub" to get your full key, as so: $ cat ~/.ssh/id_rsa.pub ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAyVKNfdBbDIk7Iq8JmL+u3vxAn4M1iaQgMU5tHJhMSAYBZEZRLZAFc+Qovxe5zzs1ixte9lCipLep39q2I4U8XND17nYliZ4HVM4MW4GsMUfKsgX2FI3mB2vAQ9pZSLkAhTg2D+92uALUSSv1cDZhTqo7DuPRX2Upxyd5QbRL6TRFswBjHz2vY/JpaPQm1S1d10mokPpmxehLfwp0mVgmz1Uv/6FflqiZ68DhDN67cs1yQgWYXQ01IHPjzTKRwCuZVkgT99rkoqy6TkAyrvsfzYPZbIA2y+ovOBzq6WCUT9gp5Jx/UE6CxLSmAuGPAJkV5D/twKIe75xc+5jdi3I1cgKw== user@laptop Copy that whole line, starting with "ssh-rsa" all the way through the very last character, including the "user@laptop" bit (which may be different for you, just be sure to include it in the line copy). Then back in the OpenStack Key Pair dialogue window, paste in the keypair in the "Public Key" window, then click "Import Key". The key should then appear in the key list. ==Launch a New Instance== We are now ready to launch our new VM instance. On the left navigation menu, select "Project", then in the submenu, select "Compute", and finally select "Instances". You will see any currently running instances in your group in the resulting screen. Next you need to click the "Launch Instance" button on the top right. You will be put into the "Details" tab in the instance creation dialogue. You need to choose an instance name and enter it into the "Instance Name" field. It should include your username as a prefix so that others know who owns each instance. Something like "frank-newtest1" would work well. You can ignore the "Description" field, "Availability Zone" should be "nova" and "Count" should be "1". Next click the "Source" tab on the left. In the "Source" menu, in the "Select Boot Source" field, select "Image" and next to it select "No" for "Create New Volume". Then in the below list of images, choose your image and click the little "Up Arrow" icon to the right of the image you want to add it. Next click the "Flavor" tab on the left. In that menu, choose how much CPU, RAM and disk space you want for your new VM. Some images have minimum requirements, and as such some of the smaller flavors may not be available. Select your flavor by clicking the little "Up Arrow" icon on the right of your flavor. Next click the "Key Pair" tab on the left. Click the little "Up Arrow" to the right of the Kep Pair you created in the previous step where you create a Key Pair. Ignore the rest of the options on the left, you have configured all you need to launch the instance. Click the blue "Launch Instance" button on the bottom right of your window, as seen below: [[File:Launch.png|850px]] You will be taken back to the Instances Summary page and you should see your new instance launching. After a bit your instance will change from the "Spawning" to "Running". This means the instance is now booting, and should finish booting in a minute or two. In the meantime we will need to attach a "Floating IP" address to your instance such that you can SSH into the instance. On the right side of your running instance, you should see a drop-down menu, usually the "Create Snapshot" option is pre-selected. Click the drop down menu arrow to open that menu, and select "Associate Floating IP". In the "Associate Floating IP" dialogue, click the drop down menu to see if any IP addresses are already available, and if so, go ahead and select one. If there are none available, click the little "+" button to the right to allocate a floating IP address. It will ask you what Pool to use, select "ext-net". You can put in a description if you want but most folks leave that field blank. Then click "Allocate IP". It will take you back one menu level. It will have a field "Port to be Associated", just leave that alone with the default that is already there. Click the blue "Associate" button on the bottom right of the window. You will be returned to the "Instances Summary" page again. You will see your instance running, and it should now list a "Floating IP" that it is running under. That is the IP that you will use to SSH to the instance. ==Connect to Your New Instance== Now that your instance is up and running, let's SSH to it and get going! '''From the computer you created your SSH keys on,''' SSH to your instance using the username as the OS type you chose (ubuntu, centos, etc), and the Floating IP address your instance has. '''You must be connected to the VPN for this to work!''' Example: $ ssh ubuntu@10.50.100.67 If you launched a CentOS instance, it would instead be "ssh centos@10.50.100.67", as appropriate. Assuming everything went as planned, you will be logged into your new Linux instance as the "ubuntu" or "centos" user, which is an unprivileged user. If you get a "Connection Refused" error when trying to SSH in, it means your instance isn't quite through launching yet, try again in about 30 seconds. You have full sudo rights however to do whatever administration you need to do. At this point it is assumed you have a little systems administration skills in your belt, or at least have some time to query Google as to how to perform various Linux tasks as necessary. Your instance has full Internet access to the Greater Internet, so you can download thing fro the Internet, run "apt-get install" or "yum update" or whatever is appropriate. You can also then install any needed software you need to get your work done. '''NOTE:''' Your are the Systems Administrator of your instance - we cannot support questions on how to administer Linux for you. If OpenStack itself is having issues then please let us know, but please defer questions like "How do I install software on Ubuntu" to Google searches. ==Storage on Your New Instance== Most of your storage on your new instance will be located in the /mnt directory, as seen by a "df -h" command on your instance: ubuntu@erich1:~$ df -h Filesystem Size Used Avail Use% Mounted on udev 16G 0 16G 0% /dev tmpfs 3.2G 676K 3.2G 1% /run /dev/vda1 20G 975M 19G 5% / tmpfs 16G 0 16G 0% /dev/shm tmpfs 5.0M 0 5.0M 0% /run/lock tmpfs 16G 0 16G 0% /sys/fs/cgroup /dev/vda15 105M 3.4M 102M 4% /boot/efi /dev/vdb1 1.0T 1.1G 1023G 1% /mnt tmpfs 3.2G 0 3.2G 0% /run/user/1000 Notice that "/mnt" has 1TB of disk space, so store all your big important data in /mnt. Avoid storing data on "/" whenever possible to prevent issues with the root filesystem filling up. The exact amount of storage available will depend on what flavor you chose when creating the instance. ==Instance Control Options== Just a few notes on controlling your instances. They are fully functioning Linux machines, so a "sudo reboot" will reboot the machine, "sudo poweroff" will shut it down, etc. In cloud parlance, "Shut Down" means the instance is still there but the power is off. "Terminated" means it's fully deleted and is unrecoverable, so be sure you want to delete your instance before you do so. We do not back instances up. We also have no access to your instance so we cannot log in and see what's going on. You can control your instance in several ways from the OpenStack web interface, in the Instance Summary page. On the right side of your instance in the list will be that little drop down menu. Options of interest are: '''1: Create Snapshot''' Never use this option as we have not implemented snapshotting in this environment. '''2: View Log''' This will show you the boot/console log of the instance, so you can see if anything is causing issues. '''3: Hard Reboot Instance''' This will hard reboot your instance, kind of like hitting the power button to power the instance off, then it will power back on moments later. Useful if your instance is hosed because of a software crash or other things that may have crashed the instance. '''4: Delete Instance''' This will permanently destroy your instance. It will be deleted and is unrecoverable. It will also free up the resources it was using such that others can use them however. This is useful if the group quotas have been reached and some old instances need to be cleaned out to make room for new ones. '''5: Start Instance''' This option will be available if the instance is in the "Shut Down" state. It will boot up the instance when this option is invoked. Do not use the other options you may see there, most have not been implemented in our deployment of OpenStack. ==Changing Your OpenStack Web Interface Password== Once you have logged in to the Web Interface, you can change your password by doing the following. On the top right of the OpenStack web interface, you should see a little icon with your username on it. Click that icon to expand the drop down menu there, and select "Settings". Then in the next window, on the left navigation bar, you should see the "Change Password" button. Complete the Change Password dialogue to change your password. You may have to log in again after changing your password. ==Networking== Your instances are connected at 10Gb/s between each other and the internet. Of course, actual transfer speeds will likely vary based on disk speed, speed of the location to are transferring data to or from, and other factors. Your instance will be located in a private network that can only be seen by other instances in your group. Other OpenStack groups are logically separated into their own networks and your instance cannot route to them. Also, no one can access your instance unless they have a VPN account with us, so your instances are completely fenced off from the Greater Internet inbound, which means you are largely secure against script kiddies and hackers. You are able to connect outbound from your instances. ==Etiquette== There is one main thing to remember when using instances in OpenStack. When you create an instance, it uses CPU, RAM and most importantly, it pins disk space for that instance. If you use up all the disk, CPU and RAM quota for your group, then others have no resources left to create their own instances. It is important to know that the best plan of action is to fire up your VM and keep it up when you need it, and then copy your data off it and delete the instance. Document steps taken to create your instance such that you could do it again if you needed to. If the physical node that your instance resides on blows up, then your instance is lost forever and we have no backups, so it is up to you to back up important data. It's also not good form to spin up an instance and store data there, but not log in for months at a time. Then, you are pinning resources that other may need for urgent work. Try to be a good neighbor! b68aa4ef63fcdce18d6681873003a3723bbe0868 584 583 2025-02-09T16:22:53Z Weiler 3 /* Upload your SSH Public Key */ wikitext text/x-wiki __TOC__ ==Request an OpenStack Account== Once you have [http://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access PRISM/GI VPN access], you can request an OpenStack account. You will need to send an email to cluster-admin@soe.ucsc.edu asking for access, and let us know which lab you are in, or who your PI is, so we can place you in the right OpenStack group. ==Create a SSH Public/Private Keypair== To log into an OpenStack VM instance, you will need a SSH public key. The key is "injected" into the instance upon creation, and only someone with that key (i.e. you) will be able to log in via SSH initially. If you already have a SSH public and private key that you use elsewhere, you can use that one, and can skip to the next step. If you don't have a SSH keypair set up yet, then you will need to log into the UNIX compatible machine you will be '''logging in from''' (a Mac/Apple computer will also work), and then run the 'ssh-keygen' command. If you are behind the VPN, you can first log into mustard, crimson or razzmatazz, which are linux servers. The command will look something like this: $ ssh-keygen -t rsa Generating public/private rsa key pair. Enter file in which to save the key (/public/home/frank/.ssh/id_rsa): Created directory '/public/home/frank/.ssh'. Enter passphrase (empty for no passphrase): [JUST HIT ENTER] Enter same passphrase again: [JUST HIT ENTER] Your identification has been saved in /public/home/frank/.ssh/id_rsa. Your public key has been saved in /public/home/frank/.ssh/id_rsa.pub. The key fingerprint is: SHA256:dhJG1A3gcwj7Mz17ommt3NIczMVVgrzp8Tf6F1X4jpI The key's randomart image is: +---[RSA 2048]----+ | ..+o.+ ..o.| | = .. + o..| | . * .. + ..| | o = * o| | So+o + o.| | . =+oE ooo| | +o.....o| | .o++o . .| | .=o. ...| +----[SHA256]-----+ You will then have a new directory, "~/.ssh", and inside that directory you will have a file called "id_rsa.pub". That is your SSH public key. You will need this in the next step in order to set up your key in OpenStack. ==Log In To giCloud== Once you have been notified that your account has been set up and have been given login credentials, connect to the VPN and then go to this link in your favorite web browser, which is the login page: http://gicloud.prism To login, enter your username and password. Also you will see a "Domain" field, just enter the word "default" for the domain. Click "Log In". You will be logged into your group's summary page. ==Upload your SSH Public Key== After creating your new key in the above "Create a SSH Public/Private Keypair" step, you will need to upload that key into OpenStack. Once you are logged in, on the left hand navigation menu, click "Project", then in the submenu, select "Compute", and finally select "Key Pairs". It should take you to the "Key Pairs" window as shown here. [[File:Keypair.png|900px]] Next click the "Import Public Key" button on the top right of the window. In the resulting window, name your key in the "Key Pair Name" field. Name it something descriptive like "laptop-key" if the key is on your laptop, or "mustard-key" if you are logged into mustard, etc. '''Your key must be an RSA key!''' The newer ED25519 keys '''do not work''' with our version of OpenStack. To get your key, open a terminal window and type "cat ~/.ssh/id_rsa.pub" to get your full key, as so: $ cat ~/.ssh/id_rsa.pub ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAyVKNfdBbDIk7Iq8JmL+u3vxAn4M1iaQgMU5tHJhMSAYBZEZRLZAFc+Qovxe5zzs1ixte9lCipLep39q2I4U8XND17nYliZ4HVM4MW4GsMUfKsgX2FI3mB2vAQ9pZSLkAhTg2D+92uALUSSv1cDZhTqo7DuPRX2Upxyd5QbRL6TRFswBjHz2vY/JpaPQm1S1d10mokPpmxehLfwp0mVgmz1Uv/6FflqiZ68DhDN67cs1yQgWYXQ01IHPjzTKRwCuZVkgT99rkoqy6TkAyrvsfzYPZbIA2y+ovOBzq6WCUT9gp5Jx/UE6CxLSmAuGPAJkV5D/twKIe75xc+5jdi3I1cgKw== user@laptop Copy that whole line, starting with "ssh-rsa" all the way through the very last character, including the "user@laptop" bit (which may be different for you, just be sure to include it in the line copy). Then back in the OpenStack Key Pair dialogue window, paste in the keypair in the "Public Key" window, then click "Import Key". The key should then appear in the key list. ==Launch a New Instance== We are now ready to launch our new VM instance. On the left navigation menu, select "Project", then in the submenu, select "Compute", and finally select "Instances". You will see any currently running instances in your group in the resulting screen. Next you need to click the "Launch Instance" button on the top right. You will be put into the "Details" tab in the instance creation dialogue. You need to choose an instance name and enter it into the "Instance Name" field. It should include your username as a prefix so that others know who owns each instance. Something like "frank-newtest1" would work well. You can ignore the "Description" field, "Availability Zone" should be "nova" and "Count" should be "1". Next click the "Source" tab on the left. In the "Source" menu, in the "Select Boot Source" field, select "Image" and next to it select "No" for "Create New Volume". Then in the below list of images, choose your image and click the little "Up Arrow" icon to the right of the image you want to add it. Next click the "Flavor" tab on the left. In that menu, choose how much CPU, RAM and disk space you want for your new VM. Some images have minimum requirements, and as such some of the smaller flavors may not be available. Select your flavor by clicking the little "Up Arrow" icon on the right of your flavor. Next click the "Key Pair" tab on the left. Click the little "Up Arrow" to the right of the Kep Pair you created in the previous step where you create a Key Pair. Ignore the rest of the options on the left, you have configured all you need to launch the instance. Click the blue "Launch Instance" button on the bottom right of your window, as seen below: [[File:Launch.png|850px]] You will be taken back to the Instances Summary page and you should see your new instance launching. After a bit your instance will change from the "Spawning" to "Running". This means the instance is now booting, and should finish booting in a minute or two. In the meantime we will need to attach a "Floating IP" address to your instance such that you can SSH into the instance. On the right side of your running instance, you should see a drop-down menu, usually the "Create Snapshot" option is pre-selected. Click the drop down menu arrow to open that menu, and select "Associate Floating IP". In the "Associate Floating IP" dialogue, click the drop down menu to see if any IP addresses are already available, and if so, go ahead and select one. If there are none available, click the little "+" button to the right to allocate a floating IP address. It will ask you what Pool to use, select "ext-net". You can put in a description if you want but most folks leave that field blank. Then click "Allocate IP". It will take you back one menu level. It will have a field "Port to be Associated", just leave that alone with the default that is already there. Click the blue "Associate" button on the bottom right of the window. You will be returned to the "Instances Summary" page again. You will see your instance running, and it should now list a "Floating IP" that it is running under. That is the IP that you will use to SSH to the instance. ==Connect to Your New Instance== Now that your instance is up and running, let's SSH to it and get going! '''From the computer you created your SSH keys on,''' SSH to your instance using the username as the OS type you chose (ubuntu, centos, etc), and the Floating IP address your instance has. '''You must be connected to the VPN for this to work!''' Example: $ ssh ubuntu@10.50.100.67 If you launched a CentOS instance, it would instead be "ssh centos@10.50.100.67", as appropriate. Assuming everything went as planned, you will be logged into your new Linux instance as the "ubuntu" or "centos" user, which is an unprivileged user. If you get a "Connection Refused" error when trying to SSH in, it means your instance isn't quite through launching yet, try again in about 30 seconds. You have full sudo rights however to do whatever administration you need to do. At this point it is assumed you have a little systems administration skills in your belt, or at least have some time to query Google as to how to perform various Linux tasks as necessary. Your instance has full Internet access to the Greater Internet, so you can download thing fro the Internet, run "apt-get install" or "yum update" or whatever is appropriate. You can also then install any needed software you need to get your work done. '''NOTE:''' Your are the Systems Administrator of your instance - we cannot support questions on how to administer Linux for you. If OpenStack itself is having issues then please let us know, but please defer questions like "How do I install software on Ubuntu" to Google searches. ==Storage on Your New Instance== Most of your storage on your new instance will be located in the /mnt directory, as seen by a "df -h" command on your instance: ubuntu@erich1:~$ df -h Filesystem Size Used Avail Use% Mounted on udev 16G 0 16G 0% /dev tmpfs 3.2G 676K 3.2G 1% /run /dev/vda1 20G 975M 19G 5% / tmpfs 16G 0 16G 0% /dev/shm tmpfs 5.0M 0 5.0M 0% /run/lock tmpfs 16G 0 16G 0% /sys/fs/cgroup /dev/vda15 105M 3.4M 102M 4% /boot/efi /dev/vdb1 1.0T 1.1G 1023G 1% /mnt tmpfs 3.2G 0 3.2G 0% /run/user/1000 Notice that "/mnt" has 1TB of disk space, so store all your big important data in /mnt. Avoid storing data on "/" whenever possible to prevent issues with the root filesystem filling up. The exact amount of storage available will depend on what flavor you chose when creating the instance. ==Instance Control Options== Just a few notes on controlling your instances. They are fully functioning Linux machines, so a "sudo reboot" will reboot the machine, "sudo poweroff" will shut it down, etc. In cloud parlance, "Shut Down" means the instance is still there but the power is off. "Terminated" means it's fully deleted and is unrecoverable, so be sure you want to delete your instance before you do so. We do not back instances up. We also have no access to your instance so we cannot log in and see what's going on. You can control your instance in several ways from the OpenStack web interface, in the Instance Summary page. On the right side of your instance in the list will be that little drop down menu. Options of interest are: '''1: Create Snapshot''' Never use this option as we have not implemented snapshotting in this environment. '''2: View Log''' This will show you the boot/console log of the instance, so you can see if anything is causing issues. '''3: Hard Reboot Instance''' This will hard reboot your instance, kind of like hitting the power button to power the instance off, then it will power back on moments later. Useful if your instance is hosed because of a software crash or other things that may have crashed the instance. '''4: Delete Instance''' This will permanently destroy your instance. It will be deleted and is unrecoverable. It will also free up the resources it was using such that others can use them however. This is useful if the group quotas have been reached and some old instances need to be cleaned out to make room for new ones. '''5: Start Instance''' This option will be available if the instance is in the "Shut Down" state. It will boot up the instance when this option is invoked. Do not use the other options you may see there, most have not been implemented in our deployment of OpenStack. ==Changing Your OpenStack Web Interface Password== Once you have logged in to the Web Interface, you can change your password by doing the following. On the top right of the OpenStack web interface, you should see a little icon with your username on it. Click that icon to expand the drop down menu there, and select "Settings". Then in the next window, on the left navigation bar, you should see the "Change Password" button. Complete the Change Password dialogue to change your password. You may have to log in again after changing your password. ==Networking== Your instances are connected at 10Gb/s between each other and the internet. Of course, actual transfer speeds will likely vary based on disk speed, speed of the location to are transferring data to or from, and other factors. Your instance will be located in a private network that can only be seen by other instances in your group. Other OpenStack groups are logically separated into their own networks and your instance cannot route to them. Also, no one can access your instance unless they have a VPN account with us, so your instances are completely fenced off from the Greater Internet inbound, which means you are largely secure against script kiddies and hackers. You are able to connect outbound from your instances. ==Etiquette== There is one main thing to remember when using instances in OpenStack. When you create an instance, it uses CPU, RAM and most importantly, it pins disk space for that instance. If you use up all the disk, CPU and RAM quota for your group, then others have no resources left to create their own instances. It is important to know that the best plan of action is to fire up your VM and keep it up when you need it, and then copy your data off it and delete the instance. Document steps taken to create your instance such that you could do it again if you needed to. If the physical node that your instance resides on blows up, then your instance is lost forever and we have no backups, so it is up to you to back up important data. It's also not good form to spin up an instance and store data there, but not log in for months at a time. Then, you are pinning resources that other may need for urgent work. Try to be a good neighbor! a0dba0dbaca77e7c9d4f67451d392d0705d2ac69 585 584 2025-02-09T16:23:18Z Weiler 3 Reverted edits by [[Special:Contributions/Weiler|Weiler]] ([[User talk:Weiler|talk]]) to last revision by [[User:Anovak|Anovak]] wikitext text/x-wiki __TOC__ ==Request an OpenStack Account== Once you have [http://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access PRISM/GI VPN access], you can request an OpenStack account. You will need to send an email to cluster-admin@soe.ucsc.edu asking for access, and let us know which lab you are in, or who your PI is, so we can place you in the right OpenStack group. ==Create a SSH Public/Private Keypair== To log into an OpenStack VM instance, you will need a SSH public key. The key is "injected" into the instance upon creation, and only someone with that key (i.e. you) will be able to log in via SSH initially. If you already have a SSH public and private key that you use elsewhere, you can use that one, and can skip to the next step. If you don't have a SSH keypair set up yet, then you will need to log into the UNIX compatible machine you will be '''logging in from''' (a Mac/Apple computer will also work), and then run the 'ssh-keygen' command. If you are behind the VPN, you can first log into mustard, crimson or razzmatazz, which are linux servers. The command will look something like this: $ ssh-keygen -t rsa Generating public/private rsa key pair. Enter file in which to save the key (/public/home/frank/.ssh/id_rsa): Created directory '/public/home/frank/.ssh'. Enter passphrase (empty for no passphrase): [JUST HIT ENTER] Enter same passphrase again: [JUST HIT ENTER] Your identification has been saved in /public/home/frank/.ssh/id_rsa. Your public key has been saved in /public/home/frank/.ssh/id_rsa.pub. The key fingerprint is: SHA256:dhJG1A3gcwj7Mz17ommt3NIczMVVgrzp8Tf6F1X4jpI The key's randomart image is: +---[RSA 2048]----+ | ..+o.+ ..o.| | = .. + o..| | . * .. + ..| | o = * o| | So+o + o.| | . =+oE ooo| | +o.....o| | .o++o . .| | .=o. ...| +----[SHA256]-----+ You will then have a new directory, "~/.ssh", and inside that directory you will have a file called "id_rsa.pub". That is your SSH public key. You will need this in the next step in order to set up your key in OpenStack. ==Log In To giCloud== Once you have been notified that your account has been set up and have been given login credentials, connect to the VPN and then go to this link in your favorite web browser, which is the login page: http://gicloud.prism To login, enter your username and password. Also you will see a "Domain" field, just enter the word "default" for the domain. Click "Log In". You will be logged into your group's summary page. ==Upload your SSH Public Key== After creating your new key in the above "Create a SSH Public/Private Keypair" step, you will need to upload that key into OpenStack. Once you are logged in, on the left hand navigation menu, click "Project", then in the submenu, select "Compute", and finally select "Key Pairs". It should take you to the "Key Pairs" window as shown here. [[File:Keypairs.png|900px]] Next click the "Import Public Key" button on the top right of the window. In the resulting window, name your key in the "Key Pair Name" field. Name it something descriptive like "laptop-key" if the key is on your laptop, or "mustard-key" if you are logged into mustard, etc. '''Your key must be an RSA key!''' The newer ED25519 keys '''do not work''' with our version of OpenStack. To get your key, open a terminal window and type "cat ~/.ssh/id_rsa.pub" to get your full key, as so: $ cat ~/.ssh/id_rsa.pub ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAyVKNfdBbDIk7Iq8JmL+u3vxAn4M1iaQgMU5tHJhMSAYBZEZRLZAFc+Qovxe5zzs1ixte9lCipLep39q2I4U8XND17nYliZ4HVM4MW4GsMUfKsgX2FI3mB2vAQ9pZSLkAhTg2D+92uALUSSv1cDZhTqo7DuPRX2Upxyd5QbRL6TRFswBjHz2vY/JpaPQm1S1d10mokPpmxehLfwp0mVgmz1Uv/6FflqiZ68DhDN67cs1yQgWYXQ01IHPjzTKRwCuZVkgT99rkoqy6TkAyrvsfzYPZbIA2y+ovOBzq6WCUT9gp5Jx/UE6CxLSmAuGPAJkV5D/twKIe75xc+5jdi3I1cgKw== user@laptop Copy that whole line, starting with "ssh-rsa" all the way through the very last character, including the "user@laptop" bit (which may be different for you, just be sure to include it in the line copy). Then back in the OpenStack Key Pair dialogue window, paste in the keypair in the "Public Key" window, then click "Import Key". The key should then appear in the key list. ==Launch a New Instance== We are now ready to launch our new VM instance. On the left navigation menu, select "Project", then in the submenu, select "Compute", and finally select "Instances". You will see any currently running instances in your group in the resulting screen. Next you need to click the "Launch Instance" button on the top right. You will be put into the "Details" tab in the instance creation dialogue. You need to choose an instance name and enter it into the "Instance Name" field. It should include your username as a prefix so that others know who owns each instance. Something like "frank-newtest1" would work well. You can ignore the "Description" field, "Availability Zone" should be "nova" and "Count" should be "1". Next click the "Source" tab on the left. In the "Source" menu, in the "Select Boot Source" field, select "Image" and next to it select "No" for "Create New Volume". Then in the below list of images, choose your image and click the little "Up Arrow" icon to the right of the image you want to add it. Next click the "Flavor" tab on the left. In that menu, choose how much CPU, RAM and disk space you want for your new VM. Some images have minimum requirements, and as such some of the smaller flavors may not be available. Select your flavor by clicking the little "Up Arrow" icon on the right of your flavor. Next click the "Key Pair" tab on the left. Click the little "Up Arrow" to the right of the Kep Pair you created in the previous step where you create a Key Pair. Ignore the rest of the options on the left, you have configured all you need to launch the instance. Click the blue "Launch Instance" button on the bottom right of your window, as seen below: [[File:Launch.png|850px]] You will be taken back to the Instances Summary page and you should see your new instance launching. After a bit your instance will change from the "Spawning" to "Running". This means the instance is now booting, and should finish booting in a minute or two. In the meantime we will need to attach a "Floating IP" address to your instance such that you can SSH into the instance. On the right side of your running instance, you should see a drop-down menu, usually the "Create Snapshot" option is pre-selected. Click the drop down menu arrow to open that menu, and select "Associate Floating IP". In the "Associate Floating IP" dialogue, click the drop down menu to see if any IP addresses are already available, and if so, go ahead and select one. If there are none available, click the little "+" button to the right to allocate a floating IP address. It will ask you what Pool to use, select "ext-net". You can put in a description if you want but most folks leave that field blank. Then click "Allocate IP". It will take you back one menu level. It will have a field "Port to be Associated", just leave that alone with the default that is already there. Click the blue "Associate" button on the bottom right of the window. You will be returned to the "Instances Summary" page again. You will see your instance running, and it should now list a "Floating IP" that it is running under. That is the IP that you will use to SSH to the instance. ==Connect to Your New Instance== Now that your instance is up and running, let's SSH to it and get going! '''From the computer you created your SSH keys on,''' SSH to your instance using the username as the OS type you chose (ubuntu, centos, etc), and the Floating IP address your instance has. '''You must be connected to the VPN for this to work!''' Example: $ ssh ubuntu@10.50.100.67 If you launched a CentOS instance, it would instead be "ssh centos@10.50.100.67", as appropriate. Assuming everything went as planned, you will be logged into your new Linux instance as the "ubuntu" or "centos" user, which is an unprivileged user. If you get a "Connection Refused" error when trying to SSH in, it means your instance isn't quite through launching yet, try again in about 30 seconds. You have full sudo rights however to do whatever administration you need to do. At this point it is assumed you have a little systems administration skills in your belt, or at least have some time to query Google as to how to perform various Linux tasks as necessary. Your instance has full Internet access to the Greater Internet, so you can download thing fro the Internet, run "apt-get install" or "yum update" or whatever is appropriate. You can also then install any needed software you need to get your work done. '''NOTE:''' Your are the Systems Administrator of your instance - we cannot support questions on how to administer Linux for you. If OpenStack itself is having issues then please let us know, but please defer questions like "How do I install software on Ubuntu" to Google searches. ==Storage on Your New Instance== Most of your storage on your new instance will be located in the /mnt directory, as seen by a "df -h" command on your instance: ubuntu@erich1:~$ df -h Filesystem Size Used Avail Use% Mounted on udev 16G 0 16G 0% /dev tmpfs 3.2G 676K 3.2G 1% /run /dev/vda1 20G 975M 19G 5% / tmpfs 16G 0 16G 0% /dev/shm tmpfs 5.0M 0 5.0M 0% /run/lock tmpfs 16G 0 16G 0% /sys/fs/cgroup /dev/vda15 105M 3.4M 102M 4% /boot/efi /dev/vdb1 1.0T 1.1G 1023G 1% /mnt tmpfs 3.2G 0 3.2G 0% /run/user/1000 Notice that "/mnt" has 1TB of disk space, so store all your big important data in /mnt. Avoid storing data on "/" whenever possible to prevent issues with the root filesystem filling up. The exact amount of storage available will depend on what flavor you chose when creating the instance. ==Instance Control Options== Just a few notes on controlling your instances. They are fully functioning Linux machines, so a "sudo reboot" will reboot the machine, "sudo poweroff" will shut it down, etc. In cloud parlance, "Shut Down" means the instance is still there but the power is off. "Terminated" means it's fully deleted and is unrecoverable, so be sure you want to delete your instance before you do so. We do not back instances up. We also have no access to your instance so we cannot log in and see what's going on. You can control your instance in several ways from the OpenStack web interface, in the Instance Summary page. On the right side of your instance in the list will be that little drop down menu. Options of interest are: '''1: Create Snapshot''' Never use this option as we have not implemented snapshotting in this environment. '''2: View Log''' This will show you the boot/console log of the instance, so you can see if anything is causing issues. '''3: Hard Reboot Instance''' This will hard reboot your instance, kind of like hitting the power button to power the instance off, then it will power back on moments later. Useful if your instance is hosed because of a software crash or other things that may have crashed the instance. '''4: Delete Instance''' This will permanently destroy your instance. It will be deleted and is unrecoverable. It will also free up the resources it was using such that others can use them however. This is useful if the group quotas have been reached and some old instances need to be cleaned out to make room for new ones. '''5: Start Instance''' This option will be available if the instance is in the "Shut Down" state. It will boot up the instance when this option is invoked. Do not use the other options you may see there, most have not been implemented in our deployment of OpenStack. ==Changing Your OpenStack Web Interface Password== Once you have logged in to the Web Interface, you can change your password by doing the following. On the top right of the OpenStack web interface, you should see a little icon with your username on it. Click that icon to expand the drop down menu there, and select "Settings". Then in the next window, on the left navigation bar, you should see the "Change Password" button. Complete the Change Password dialogue to change your password. You may have to log in again after changing your password. ==Networking== Your instances are connected at 10Gb/s between each other and the internet. Of course, actual transfer speeds will likely vary based on disk speed, speed of the location to are transferring data to or from, and other factors. Your instance will be located in a private network that can only be seen by other instances in your group. Other OpenStack groups are logically separated into their own networks and your instance cannot route to them. Also, no one can access your instance unless they have a VPN account with us, so your instances are completely fenced off from the Greater Internet inbound, which means you are largely secure against script kiddies and hackers. You are able to connect outbound from your instances. ==Etiquette== There is one main thing to remember when using instances in OpenStack. When you create an instance, it uses CPU, RAM and most importantly, it pins disk space for that instance. If you use up all the disk, CPU and RAM quota for your group, then others have no resources left to create their own instances. It is important to know that the best plan of action is to fire up your VM and keep it up when you need it, and then copy your data off it and delete the instance. Document steps taken to create your instance such that you could do it again if you needed to. If the physical node that your instance resides on blows up, then your instance is lost forever and we have no backups, so it is up to you to back up important data. It's also not good form to spin up an instance and store data there, but not log in for months at a time. Then, you are pinning resources that other may need for urgent work. Try to be a good neighbor! 2479bff28d9ffcd56d1c9d26d22a0966edf5f15e File:Configuring 2.png 6 64 590 2025-02-09T16:27:45Z Weiler 3 wikitext text/x-wiki da39a3ee5e6b4b0d3255bfef95601890afd80709 File:Configuring 3.png 6 65 592 2025-02-09T16:31:41Z Weiler 3 wikitext text/x-wiki da39a3ee5e6b4b0d3255bfef95601890afd80709 File:Configuring 4.png 6 66 593 2025-02-09T16:31:53Z Weiler 3 wikitext text/x-wiki da39a3ee5e6b4b0d3255bfef95601890afd80709 File:Configuring 5.png 6 67 594 2025-02-09T16:32:07Z Weiler 3 wikitext text/x-wiki da39a3ee5e6b4b0d3255bfef95601890afd80709 Setting Up The VPN on Windows 0 68 598 2025-02-09T17:28:43Z Weiler 3 Created page with "'''Before''' following these instructions, please ensure that you have filled out an account request form and completed all the training and requirements as detailed here: [[Requirement_for_users_to_get_GI_VPN_access]] After completing those requirements, you should have received a welcome email from us explaining that your account is ready. Once you have received that email, continue following these instructions. We will be installing OpenVPN Connect client for Wind..." wikitext text/x-wiki '''Before''' following these instructions, please ensure that you have filled out an account request form and completed all the training and requirements as detailed here: [[Requirement_for_users_to_get_GI_VPN_access]] After completing those requirements, you should have received a welcome email from us explaining that your account is ready. Once you have received that email, continue following these instructions. We will be installing OpenVPN Connect client for Windows. This VPN client currently supports '''Windows 10''' and '''Windows 11'''. Do not install this software on public or shared computers! Before installing our VPN profile, you must have enrolled your cell phone for Duo MFA using your CruzID account with UCSC. Most folks already have this from when they first started at UCSC. If you don't yet have a CruzID, please contact your sponsor/PI and ask them to help you acquire a CruzID. If you have a CruzID but haven't yet enrolled your cell phone, please follow the instructions here to enroll your phone: https://its.ucsc.edu/mfa/enroll.html After confirming your cell phone MFA enrollment, or if you have already done this a while ago, continue to install our VPN profile. You will need to download our OpenVPN client configuration file from this link: https://giwiki.gi.ucsc.edu/downloads/prism.ovpn The username and password to access that web link will be sent to you in your account creation welcome email. Save the "prism.ovpn" file locally to your desktop or somewhere else convenient. After downloading the OpenVPN configuration file, you will need to download the OpenVPN Connect client from here: https://openvpn.net/community-downloads/ Download the '''Windows 64-bit MSI installer'''. Double click on the Installer to begin installation, and follow the on-screen prompts to complete installation. Once the installation is complete, launch the '''OpenVPN Connect''' application. Review and agree to the '''Data Usage Policy'''. After opening the app, we will need to import the '''prism.ovpn''' file you downloaded earlier. Click '''File''' and browse to the location of your '''prism.ovpn''' file. Import the file. You should now be able to select the new profile and click connect. It should ask you for a username and password, which we will have sent you in our welcome email. After entering the username and apssword, you will receive a Duo Push on your phone in order to complete authentication. The OpenVPN status icon will appear in your system tray on the bottom right of your desktop. If it is yellow or red, that indicates you are not connected. If it is green, that indicates you are connected. Once you authenticate to the VPN (username/password/MFA), then login via SSH to 'mustard.prism' for example, and you will be asked to change your password. You can access SSH via several common Windows SSH client such as PuTTY. You can also use SSH right from a Command Prompt or Power Shell window (do a Windows search for "command" or "power shell" to find them). Once you have launched one of those applications, do: ssh username@mustard.prism Where "username" is the username we sent you in the welcome email (incidentally it is also your CruzID username). It will ask you for a password, just type in the password we sent you in your account creation welcome email. When you type the password, the characters '''will not''' echo to the screen, so it will not show you what you are typing. Once logging in successfully to mustard, it will as you to change your password. It will ask for you current password one more time, then it will ask you to choose a new password, which you will need to enter two times. Again, whatever password you choose '''will not''' echo to the screen. You new password must be: 1: At least 10 characters long 2: At least 3 character classes (lowercase, uppercase, number and/or special character) Once you change your password, it will log you out of mustard. Then, log out of the VPN (select "Disconnect" from the '''OpenVPN Connect''' application). Then, log back into the VPN using your '''new''' password. It will send another Duo MFA push to your phone, then you should be logged in! Then feel free to ssh to any of our firewalled servers (using your new password). As always, if you have any questions, please email '''cluster-admin@soe.ucsc.edu''' for help. 64f1983bfa1c177892aa496dfdc9db1434b71e3d 599 598 2025-02-09T17:29:37Z Weiler 3 wikitext text/x-wiki '''Before''' following these instructions, please ensure that you have filled out an account request form and completed all the training and requirements as detailed here: [[Requirement_for_users_to_get_GI_VPN_access]] After completing those requirements, you should have received a welcome email from us explaining that your account is ready. Once you have received that email, continue following these instructions. We will be installing OpenVPN Connect client for Windows. This VPN client currently supports '''Windows 10''' and '''Windows 11'''. Do not install this software on public or shared computers! Before installing our VPN profile, you must have enrolled your cell phone for Duo MFA using your CruzID account with UCSC. Most folks already have this from when they first started at UCSC. If you don't yet have a CruzID, please contact your sponsor/PI and ask them to help you acquire a CruzID. If you have a CruzID but haven't yet enrolled your cell phone, please follow the instructions here to enroll your phone: https://its.ucsc.edu/mfa/enroll.html After confirming your cell phone MFA enrollment, or if you have already done this a while ago, continue to install our VPN profile. You will need to download our OpenVPN client configuration file from this link: https://giwiki.gi.ucsc.edu/downloads/prism.ovpn The username and password to access that web link will be sent to you in your account creation welcome email. Save the "prism.ovpn" file locally to your desktop or somewhere else convenient. After downloading the OpenVPN configuration file, you will need to download the OpenVPN Connect client from here: https://openvpn.net/community-downloads/ Download the '''Windows 64-bit MSI installer'''. Double click on the Installer to begin installation, and follow the on-screen prompts to complete installation. Once the installation is complete, launch the '''OpenVPN Connect''' application. Review and agree to the '''Data Usage Policy'''. After opening the app, we will need to import the '''prism.ovpn''' file you downloaded earlier. Click '''File''' and browse to the location of your '''prism.ovpn''' file. Import the file. You should now be able to select the new profile and click connect. It should ask you for a username and password, which we will have sent you in our welcome email. After entering the username and password, you will receive a Duo Push on your phone in order to complete authentication. The OpenVPN status icon will appear in your system tray on the bottom right of your desktop. If it is yellow or red, that indicates you are not connected. If it is green, that indicates you are connected. Once you authenticate to the VPN (username/password/MFA), then login via SSH to 'mustard.prism' for example, and you will be asked to change your password. You can access SSH via several common Windows SSH client such as PuTTY. You can also use SSH right from a Command Prompt or Power Shell window (do a Windows search for "command" or "power shell" to find them). Once you have launched one of those applications, do: ssh username@mustard.prism Where "username" is the username we sent you in the welcome email (incidentally it is also your CruzID username). It will ask you for a password, just type in the password we sent you in your account creation welcome email. When you type the password, the characters '''will not''' echo to the screen, so it will not show you what you are typing. Once logging in successfully to mustard, it will as you to change your password. It will ask for you current password one more time, then it will ask you to choose a new password, which you will need to enter two times. Again, whatever password you choose '''will not''' echo to the screen. You new password must be: 1: At least 10 characters long 2: At least 3 character classes (lowercase, uppercase, number and/or special character) Once you change your password, it will log you out of mustard. Then, log out of the VPN (select "Disconnect" from the '''OpenVPN Connect''' application). Then, log back into the VPN using your '''new''' password. It will send another Duo MFA push to your phone, then you should be logged in! Then feel free to ssh to any of our firewalled servers (using your new password). As always, if you have any questions, please email '''cluster-admin@soe.ucsc.edu''' for help. fbb8f548ed339bc93d35c7c6702c0da334917e4a 601 599 2025-02-09T21:13:12Z Weiler 3 wikitext text/x-wiki '''Before''' following these instructions, please ensure that you have filled out an account request form and completed all the training and requirements as detailed here: [[Requirement_for_users_to_get_GI_VPN_access]] After completing those requirements, you should have received a welcome email from us explaining that your account is ready. Once you have received that email, continue following these instructions. We will be installing OpenVPN Connect client for Windows. This VPN client currently supports '''Windows 10''' and '''Windows 11'''. Do not install this software on public or shared computers! Before installing our VPN profile, you must have enrolled your cell phone for Duo MFA using your CruzID account with UCSC. Most folks already have this from when they first started at UCSC. If you don't yet have a CruzID, please contact your sponsor/PI and ask them to help you acquire a CruzID. If you have a CruzID but haven't yet enrolled your cell phone, please follow the instructions here to enroll your phone: https://its.ucsc.edu/mfa/enroll.html After confirming your cell phone MFA enrollment, or if you have already done this a while ago, continue to install our VPN profile. You will need to download our OpenVPN client configuration file from this link: https://giwiki.gi.ucsc.edu/downloads/prism.ovpn The username and password to access that web link will be sent to you in your account creation welcome email. Save the "prism.ovpn" file locally to your desktop or somewhere else convenient. After downloading the OpenVPN configuration file, you will need to download the OpenVPN Connect client from here: https://openvpn.net/community-downloads/ Download the '''Windows 64-bit MSI installer'''. Double click on the Installer to begin installation, and follow the on-screen prompts to complete installation. Once the installation is complete, launch the '''OpenVPN Connect''' application. Review and agree to the '''Data Usage Policy'''. After opening the app, we will need to import the '''prism.ovpn''' file you downloaded earlier. Click '''File''' and browse to the location of your '''prism.ovpn''' file. Import the file. You should now be able to select the new profile and click connect. It should ask you for a username and password, which we will have sent you in our welcome email. After entering the username and password, you will receive a Duo Push on your phone in order to complete authentication. The OpenVPN status icon will appear in your system tray on the bottom right of your desktop. If it is yellow or red, that indicates you are not connected. If it is green, that indicates you are connected. Once you authenticate to the VPN (username/password/MFA), then login via SSH to 'mustard.prism' for example, and you will be asked to change your password. You can access SSH via several common Windows SSH client such as PuTTY. You can also use SSH right from a Command Prompt or Power Shell window (do a Windows search for "command" or "power shell" to find them). Once you have launched one of those applications, do: ssh username@mustard.prism Where "username" is the username we sent you in the welcome email (incidentally it is also your CruzID username). It will ask you for a password, just type in the password we sent you in your account creation welcome email. When you type the password, the characters '''will not''' echo to the screen, so it will not show you what you are typing. Once logging in successfully to mustard, it will as you to change your password. It will ask for you current password one more time, then it will ask you to choose a new password, which you will need to enter two times. Again, whatever password you choose '''will not''' echo to the screen. You new password must be: 1: At least 10 characters long 2: At least 3 character classes (lowercase, uppercase, number and/or special character) Once you change your password, it will log you out of mustard. Then, log out of the VPN (select "Disconnect" from the '''OpenVPN Connect''' application). Then, log back into the VPN using your '''new''' password. It will send another Duo MFA push to your phone, then you should be logged in! Then feel free to ssh to any of our firewalled servers (using your new password). Note the following page for available resources: https://giwiki.gi.ucsc.edu/index.php?title=Firewalled_Computing_Resources_Overview As always, if you have any questions, please email '''cluster-admin@soe.ucsc.edu''' for help. 015e996cca204c3a6f93948f9cae3f92627b657b Multi Factor Authentication (MFA) Frequently Asked Questions 0 69 604 2025-02-10T18:15:54Z Weiler 3 Created page with "__TOC__ == Why Do We Need MFA To Login To The VPN? == We need to comply with '''NIST 800-171''' Security Standards in order to store data downloaded from NIH, according to new regulations. '''NIST 800-171''' controls require we enable MFA for VPN logins to harden our security posture. MFA is a good idea, security-wise, anyway though! Even though it can be somewhat annoying. == What Kind of MFA System Are We Using here at the GI? == We are using '''Duo Mobile''' as..." wikitext text/x-wiki __TOC__ == Why Do We Need MFA To Login To The VPN? == We need to comply with '''NIST 800-171''' Security Standards in order to store data downloaded from NIH, according to new regulations. '''NIST 800-171''' controls require we enable MFA for VPN logins to harden our security posture. MFA is a good idea, security-wise, anyway though! Even though it can be somewhat annoying. == What Kind of MFA System Are We Using here at the GI? == We are using '''Duo Mobile''' as our MFA authentication mechanism. You probably are already using it to authenticate to CruzID related systems. == Do I need a CruzID before I can use Duo Mobile at the Genomics Institute? == Yes, you do. If you do not yet have a CruzID, please ask your sponsor or PI to get you a CruzID set up. You will need this active before you can authenticate to the Genomics Institute VPN. == Duo is Working To Login to the GI VPN, But It Is Calling My Phone Instead of Sending Me a Push! What Can I Do? == If you previously set up Duo to send you a text with a code, or to call you to authenticate, '''and you would prefer to just receive a Push Notification instead''', f6d26f75e9e1ec425c97ac122fb80e9b864e5b75 608 604 2025-02-10T21:38:12Z Weiler 3 /* Duo is Working To Login to the GI VPN, But It Is Calling My Phone Instead of Sending Me a Push! What Can I Do? */ wikitext text/x-wiki __TOC__ == Why Do We Need MFA To Login To The VPN? == We need to comply with '''NIST 800-171''' Security Standards in order to store data downloaded from NIH, according to new regulations. '''NIST 800-171''' controls require we enable MFA for VPN logins to harden our security posture. MFA is a good idea, security-wise, anyway though! Even though it can be somewhat annoying. == What Kind of MFA System Are We Using here at the GI? == We are using '''Duo Mobile''' as our MFA authentication mechanism. You probably are already using it to authenticate to CruzID related systems. == Do I need a CruzID before I can use Duo Mobile at the Genomics Institute? == Yes, you do. If you do not yet have a CruzID, please ask your sponsor or PI to get you a CruzID set up. You will need this active before you can authenticate to the Genomics Institute VPN. == Duo is Working To Login to the GI VPN, But It Is Calling My Phone Instead of Sending Me a Push! What Can I Do? == If you previously set up Duo to send you a text with a code, or to call you to authenticate, '''and you would prefer to just receive a Push Notification instead''', you can do it by logging in here: https://cruzid.ucsc.edu/idmuser_login Use your CruzID Gold username and password. You may get a call or text with MFA stuff in it as usual, but don't act on that yet. During the Duo notice that pops up in your web browser that says "Verify your identity..." there will be a small link below that which says '''Other Options'''. Click that, and from there you should be able to change the way in which authenticate MFA, enroll a new device (like a phone), etc. 5d011b985687d3ab8c6c2827a70e428a18473aaa 609 608 2025-02-10T22:25:41Z Weiler 3 wikitext text/x-wiki __TOC__ == Why Do We Need MFA To Login To The VPN? == We need to comply with '''NIST 800-171''' Security Standards in order to store data downloaded from NIH, according to new regulations. '''NIST 800-171''' controls require we enable MFA for VPN logins to harden our security posture. MFA is a good idea, security-wise, anyway though! Even though it can be somewhat annoying. == What Kind of MFA System Are We Using here at the GI? == We are using '''Duo Mobile''' as our MFA authentication mechanism. You probably are already using it to authenticate to CruzID related systems. == Do I need a CruzID before I can use Duo Mobile at the Genomics Institute? == Yes, you do. If you do not yet have a CruzID, please ask your sponsor or PI to get you a CruzID set up. You will need this active before you can authenticate to the Genomics Institute VPN. == Duo is Working To Login to the GI VPN, But It Is Calling My Phone Instead of Sending Me a Push! What Can I Do? == If you previously set up Duo to send you a text with a code, or to call you to authenticate, '''and you would prefer to just receive a Push Notification instead''', you can do it by logging in here: https://cruzid.ucsc.edu/idmuser_login Use your CruzID Gold username and password. You may get a call or text with MFA stuff in it as usual, but don't act on that yet. During the Duo notice that pops up in your web browser that says "Verify your identity..." there will be a small link below that which says '''Other Options'''. Click that, and from there you should be able to change the way in which authenticate MFA, enroll a new device (like a phone) by selecting '''Manage Devices''', etc. 67d006283380ed9669c81a95a43e63c4aacb2d86 610 609 2025-02-10T22:26:16Z Weiler 3 /* Duo is Working To Login to the GI VPN, But It Is Calling My Phone Instead of Sending Me a Push! What Can I Do? */ wikitext text/x-wiki __TOC__ == Why Do We Need MFA To Login To The VPN? == We need to comply with '''NIST 800-171''' Security Standards in order to store data downloaded from NIH, according to new regulations. '''NIST 800-171''' controls require we enable MFA for VPN logins to harden our security posture. MFA is a good idea, security-wise, anyway though! Even though it can be somewhat annoying. == What Kind of MFA System Are We Using here at the GI? == We are using '''Duo Mobile''' as our MFA authentication mechanism. You probably are already using it to authenticate to CruzID related systems. == Do I need a CruzID before I can use Duo Mobile at the Genomics Institute? == Yes, you do. If you do not yet have a CruzID, please ask your sponsor or PI to get you a CruzID set up. You will need this active before you can authenticate to the Genomics Institute VPN. == Duo is Working To Login to the GI VPN, But It Is Calling My Phone Instead of Sending Me a Push! What Can I Do? == If you previously set up Duo to send you a text with a code, or to call you to authenticate, '''and you would prefer to just receive a Push Notification instead''', you can do it by logging in here: https://cruzid.ucsc.edu/idmuser_login Use your CruzID Gold username and password. You may get a call or text with MFA stuff in it as usual, but don't act on that yet. During the Duo notice that pops up in your web browser that says "Verify your identity..." there will be a small link below that which says '''Other Options'''. Click that, and from there you should be able to change the way in which Duo MFA authenticates you, enroll a new device (like a phone) by selecting '''Manage Devices''', etc. 4ddae516c757eb6281465afa82b31c860a31108f 628 610 2025-02-11T22:29:47Z Weiler 3 wikitext text/x-wiki __TOC__ == Why Do We Need MFA To Login To The VPN? == We need to comply with '''NIST 800-171''' Security Standards in order to store data downloaded from NIH, according to new regulations. '''NIST 800-171''' controls require we enable MFA for VPN logins to harden our security posture. MFA is a good idea, security-wise, anyway though! Even though it can be somewhat annoying. == What Kind of MFA System Are We Using here at the GI? == We are using '''Duo Mobile''' as our MFA authentication mechanism. You probably are already using it to authenticate to CruzID related systems. == Do I need a CruzID before I can use Duo Mobile at the Genomics Institute? == Yes, you do. If you do not yet have a CruzID, please ask your sponsor or PI to get you a CruzID set up. You will need this active before you can authenticate to the Genomics Institute VPN. == Duo is Working To Login to the GI VPN, But It Is Calling My Phone Instead of Sending Me a Push! What Can I Do? == If you previously set up Duo to send you a text with a code, or to call you to authenticate, '''and you would prefer to just receive a Push Notification instead''', you can do it by logging in here: https://cruzid.ucsc.edu/idmuser_login Use your CruzID Gold username and password. You may get a call or text with MFA stuff in it as usual, but don't act on that yet. During the Duo notice that pops up in your web browser that says "Verify your identity..." there will be a small link below that which says '''Other Options'''. Click that, and from there you should be able to change the way in which Duo MFA authenticates you, enroll a new device (like a phone) by selecting '''Manage Devices''', etc. == All This Documentation References "Duo Push", but I use Duo by Another Method...? == Most of our users use Duo in the context of getting a "Push", i.e. when you enter your username and password, you get a Push Request on your phone and click the green "Accept" button to finish authentication. But there are a few cases where that is not possible. If the Push option of Duo authentication is not possible, you can utilize the "Rolling Code" option or the "Yubikey" option. For the "Rolling Code" option, you must have already enrolled Duo on your phone. Then open the Duo App on your phone and click the "UC Santa Cruz" option. It should show you a six digit passcode. Then, when you authenticate to the Genomics Institute VPN, type your username in the "Username" field on your VPN client, then in the password field, type your password, then a comma, then the six digit code you see in the Duo App. For example, if my credentials were: username: bob password: C@ndyIsFun and I looked on my phone and saw my six digit code in the Duo App as "643726", I would enter these credentials: username: bob password: C@ndyIsFun,643726 And that would authenticate me. Same with a Yubikey. In the "Password" field, just type in your password followed by a comma, followed by the code on your Yubikey. f55820aa1e53e2090bfcd3ab8f75192939ffbf91 629 628 2025-02-11T22:33:26Z Weiler 3 /* All This Documentation References "Duo Push", but I use Duo by Another Method...? */ wikitext text/x-wiki __TOC__ == Why Do We Need MFA To Login To The VPN? == We need to comply with '''NIST 800-171''' Security Standards in order to store data downloaded from NIH, according to new regulations. '''NIST 800-171''' controls require we enable MFA for VPN logins to harden our security posture. MFA is a good idea, security-wise, anyway though! Even though it can be somewhat annoying. == What Kind of MFA System Are We Using here at the GI? == We are using '''Duo Mobile''' as our MFA authentication mechanism. You probably are already using it to authenticate to CruzID related systems. == Do I need a CruzID before I can use Duo Mobile at the Genomics Institute? == Yes, you do. If you do not yet have a CruzID, please ask your sponsor or PI to get you a CruzID set up. You will need this active before you can authenticate to the Genomics Institute VPN. == Duo is Working To Login to the GI VPN, But It Is Calling My Phone Instead of Sending Me a Push! What Can I Do? == If you previously set up Duo to send you a text with a code, or to call you to authenticate, '''and you would prefer to just receive a Push Notification instead''', you can do it by logging in here: https://cruzid.ucsc.edu/idmuser_login Use your CruzID Gold username and password. You may get a call or text with MFA stuff in it as usual, but don't act on that yet. During the Duo notice that pops up in your web browser that says "Verify your identity..." there will be a small link below that which says '''Other Options'''. Click that, and from there you should be able to change the way in which Duo MFA authenticates you, enroll a new device (like a phone) by selecting '''Manage Devices''', etc. == All This Documentation References "Duo Push", but I use Duo by Another Method...? == Most of our users use Duo in the context of getting a "Push", i.e. when you enter your username and password, you get a Push Request on your phone and click the green "Accept" button to finish authentication. But there are a few cases where that is not possible. If the Push option of Duo authentication is not possible, you can utilize the "Rolling Code" option or the "Yubikey" option. For the "Rolling Code" option, you must have already enrolled Duo on your phone. Then open the Duo App on your phone and click the "UC Santa Cruz" option. It should show you a six digit passcode. Then, when you authenticate to the Genomics Institute VPN, type your username in the "Username" field on your VPN client, then in the password field, type your password, then a comma, then the six digit code you see in the Duo App. *** Rolling Code Option: For example, if my credentials were: username: bob password: C@ndyIsFun and I looked on my phone and saw my six digit code in the Duo App as "643726", I would enter these credentials: username: bob password: C@ndyIsFun,643726 And that would authenticate me. *** Yubikey Option It's the same idea with a Yubikey. In the "Password" field, just type in your password followed by a comma, followed by the code on your Yubikey. 26a14b1a0b0d033c8f3890aa041b29774acbc3fc 630 629 2025-02-11T22:33:44Z Weiler 3 /* All This Documentation References "Duo Push", but I use Duo by Another Method...? */ wikitext text/x-wiki __TOC__ == Why Do We Need MFA To Login To The VPN? == We need to comply with '''NIST 800-171''' Security Standards in order to store data downloaded from NIH, according to new regulations. '''NIST 800-171''' controls require we enable MFA for VPN logins to harden our security posture. MFA is a good idea, security-wise, anyway though! Even though it can be somewhat annoying. == What Kind of MFA System Are We Using here at the GI? == We are using '''Duo Mobile''' as our MFA authentication mechanism. You probably are already using it to authenticate to CruzID related systems. == Do I need a CruzID before I can use Duo Mobile at the Genomics Institute? == Yes, you do. If you do not yet have a CruzID, please ask your sponsor or PI to get you a CruzID set up. You will need this active before you can authenticate to the Genomics Institute VPN. == Duo is Working To Login to the GI VPN, But It Is Calling My Phone Instead of Sending Me a Push! What Can I Do? == If you previously set up Duo to send you a text with a code, or to call you to authenticate, '''and you would prefer to just receive a Push Notification instead''', you can do it by logging in here: https://cruzid.ucsc.edu/idmuser_login Use your CruzID Gold username and password. You may get a call or text with MFA stuff in it as usual, but don't act on that yet. During the Duo notice that pops up in your web browser that says "Verify your identity..." there will be a small link below that which says '''Other Options'''. Click that, and from there you should be able to change the way in which Duo MFA authenticates you, enroll a new device (like a phone) by selecting '''Manage Devices''', etc. == All This Documentation References "Duo Push", but I use Duo by Another Method...? == Most of our users use Duo in the context of getting a "Push", i.e. when you enter your username and password, you get a Push Request on your phone and click the green "Accept" button to finish authentication. But there are a few cases where that is not possible. If the Push option of Duo authentication is not possible, you can utilize the "Rolling Code" option or the "Yubikey" option. For the "Rolling Code" option, you must have already enrolled Duo on your phone. Then open the Duo App on your phone and click the "UC Santa Cruz" option. It should show you a six digit passcode. Then, when you authenticate to the Genomics Institute VPN, type your username in the "Username" field on your VPN client, then in the password field, type your password, then a comma, then the six digit code you see in the Duo App. =Rolling Code Option= For example, if my credentials were: username: bob password: C@ndyIsFun and I looked on my phone and saw my six digit code in the Duo App as "643726", I would enter these credentials: username: bob password: C@ndyIsFun,643726 And that would authenticate me. *** Yubikey Option It's the same idea with a Yubikey. In the "Password" field, just type in your password followed by a comma, followed by the code on your Yubikey. 42bf0430afd882d5cb49c75fc3a42533c225684d 631 630 2025-02-11T22:34:16Z Weiler 3 wikitext text/x-wiki __TOC__ == Why Do We Need MFA To Login To The VPN? == We need to comply with '''NIST 800-171''' Security Standards in order to store data downloaded from NIH, according to new regulations. '''NIST 800-171''' controls require we enable MFA for VPN logins to harden our security posture. MFA is a good idea, security-wise, anyway though! Even though it can be somewhat annoying. == What Kind of MFA System Are We Using here at the GI? == We are using '''Duo Mobile''' as our MFA authentication mechanism. You probably are already using it to authenticate to CruzID related systems. == Do I need a CruzID before I can use Duo Mobile at the Genomics Institute? == Yes, you do. If you do not yet have a CruzID, please ask your sponsor or PI to get you a CruzID set up. You will need this active before you can authenticate to the Genomics Institute VPN. == Duo is Working To Login to the GI VPN, But It Is Calling My Phone Instead of Sending Me a Push! What Can I Do? == If you previously set up Duo to send you a text with a code, or to call you to authenticate, '''and you would prefer to just receive a Push Notification instead''', you can do it by logging in here: https://cruzid.ucsc.edu/idmuser_login Use your CruzID Gold username and password. You may get a call or text with MFA stuff in it as usual, but don't act on that yet. During the Duo notice that pops up in your web browser that says "Verify your identity..." there will be a small link below that which says '''Other Options'''. Click that, and from there you should be able to change the way in which Duo MFA authenticates you, enroll a new device (like a phone) by selecting '''Manage Devices''', etc. == All This Documentation References "Duo Push", but I use Duo by Another Method...? == Most of our users use Duo in the context of getting a "Push", i.e. when you enter your username and password, you get a Push Request on your phone and click the green "Accept" button to finish authentication. But there are a few cases where that is not possible. If the Push option of Duo authentication is not possible, you can utilize the "Rolling Code" option or the "Yubikey" option. For the "Rolling Code" option, you must have already enrolled Duo on your phone. Then open the Duo App on your phone and click the "UC Santa Cruz" option. It should show you a six digit passcode. Then, when you authenticate to the Genomics Institute VPN, type your username in the "Username" field on your VPN client, then in the password field, type your password, then a comma, then the six digit code you see in the Duo App. '''Rolling Code Option''' For example, if my credentials were: username: bob password: C@ndyIsFun and I looked on my phone and saw my six digit code in the Duo App as "643726", I would enter these credentials: username: bob password: C@ndyIsFun,643726 And that would authenticate me. '''Yubikey Option''' It's the same idea with a Yubikey. In the "Password" field, just type in your password followed by a comma, followed by the code on your Yubikey. 1ef90471e76807eb378f45448dd099390a5a59f0 632 631 2025-02-11T22:35:59Z Weiler 3 /* All This Documentation References "Duo Push", but I use Duo by Another Method...? */ wikitext text/x-wiki __TOC__ == Why Do We Need MFA To Login To The VPN? == We need to comply with '''NIST 800-171''' Security Standards in order to store data downloaded from NIH, according to new regulations. '''NIST 800-171''' controls require we enable MFA for VPN logins to harden our security posture. MFA is a good idea, security-wise, anyway though! Even though it can be somewhat annoying. == What Kind of MFA System Are We Using here at the GI? == We are using '''Duo Mobile''' as our MFA authentication mechanism. You probably are already using it to authenticate to CruzID related systems. == Do I need a CruzID before I can use Duo Mobile at the Genomics Institute? == Yes, you do. If you do not yet have a CruzID, please ask your sponsor or PI to get you a CruzID set up. You will need this active before you can authenticate to the Genomics Institute VPN. == Duo is Working To Login to the GI VPN, But It Is Calling My Phone Instead of Sending Me a Push! What Can I Do? == If you previously set up Duo to send you a text with a code, or to call you to authenticate, '''and you would prefer to just receive a Push Notification instead''', you can do it by logging in here: https://cruzid.ucsc.edu/idmuser_login Use your CruzID Gold username and password. You may get a call or text with MFA stuff in it as usual, but don't act on that yet. During the Duo notice that pops up in your web browser that says "Verify your identity..." there will be a small link below that which says '''Other Options'''. Click that, and from there you should be able to change the way in which Duo MFA authenticates you, enroll a new device (like a phone) by selecting '''Manage Devices''', etc. == All This Documentation References "Duo Push", but I use Duo by Another Method...? == Most of our users use Duo in the context of getting a "Push", i.e. when you enter your username and password, you get a Push Request on your phone and click the green "Accept" button to finish authentication. But there are a few cases where that is not possible. If the Push option of Duo authentication is not possible, you can utilize the "Rolling Code" option or the "Yubikey" option. For the "Rolling Code" option, you must have already enrolled Duo on your phone. Then open the Duo App on your phone and click the "UC Santa Cruz" option. It should show you a six digit passcode. Then, when you authenticate to the Genomics Institute VPN, type your username in the "Username" field on your VPN client, then in the password field, type your password, then a comma, then the six digit code you see in the Duo App. '''Rolling Code Option''' For example, if my credentials were: username: bob password: C@ndyIsFun and I looked on my phone and saw my six digit code in the Duo App as "643726", I would enter these credentials: username: bob password: C@ndyIsFun,643726 And that would authenticate me. '''Yubikey Option''' It's the same idea with a Yubikey. In the "Password" field, just type in your password followed by a comma, followed by the code on your Yubikey Application. 53d738d6aedcb63127ffe8d03cf80961696c8d36 Setting Up The VPN on MacOS 0 61 605 600 2025-02-10T21:20:42Z Weiler 3 wikitext text/x-wiki '''Before''' following these instructions, please ensure that you have filled out an account request form and completed all the training and requirements as detailed here: [[Requirement_for_users_to_get_GI_VPN_access]] After completing those requirements, you should have received a welcome email from us explaining that your account is ready. Once you have received that email, continue following these instructions. For MacOS, you will be installing "Tunnelblick", an OpenVPN client software package for Mac. Do not install this software on public or shared computers! Before installing Tunnelblick, you must have enrolled your cell phone for Duo MFA using your CruzID account with UCSC. Most folks already have this from when they first started at UCSC. If you don't yet have a CruzID, please contact your sponsor/PI and ask them to help you acquire a CruzID. If you have a CruzID but haven't yet enrolled your cell phone, please follow the instructions here to enroll your phone: https://its.ucsc.edu/mfa/enroll.html After confirming your cell phone MFA enrollment, or if you have already done this a while ago, continue to install Tunnelblick. Download the OpenVPN configuration file we will be using. The username and password to access this web link should have been sent to you in your account creation welcome email: https://giwiki.gi.ucsc.edu/downloads/prism-duo.ovpn And save that file to your Desktop. Next, you will need to download Tunnelblick (the latest Stable Version) from this link: https://tunnelblick.net/downloads.html Once you have downloaded Tunnelblick, double-click on it and proceed through the installation steps. During installation, it will ask you if you want to install for "Only You" or "All Users". Select "Only You". At the end it will ask if you have any configuration files, say "Yes" and select the '''prism-duo.ovpn''' file you downloaded earlier. After installation, in your Finder, you may want to navigate to the Applications folder and drag the Tunnelblick icon to your dock for easy launching. After launching Tunnelblick from the Applications folder, you will see a small "tunnel" icon on the top right of your screen, next to the date and WiFi icon. You should be able to click on that icon, then click "Connect prism-duo" to start the VPN. "Prism" is the name of our firewalled environment. Use the username and temporary password that we sent to you in your account creation welcome email to login to the VPN for the first time. After typing in your username and password, you will be sent a Duo MFA push to your phone. Accept that push, and then you will be connected. Once you authenticate to the VPN (username/password/MFA), then login via SSH to 'mustard.prism' for example, and you will be asked to change your password. If you are not familiar with SSH, then you will need to open the "Terminal" application which can be found in your Applications Folder under "Utilities". After launching "Terminal" you will connect to mustard by typing: ssh username@mustard.prism Where "username" is the username we sent you in the welcome email (incidentally it is also your CruzID username). It will ask you for a password, just type in the password we sent you in your account creation welcome email. When you type the password, the characters '''will not''' echo to the screen, so it will not show you what you are typing. Once logging in successfully to mustard, it will as you to change your password. It will ask for you current password one more time, then it will ask you to choose a new password, which you will need to enter two times. Again, whatever password you choose '''will not''' echo to the screen. You new password must be: 1: At least 10 characters long 2: At least 3 character classes (lowercase, uppercase, number and/or special character) Once you change your password, it will log you out of mustard. Then, log out of the VPN (click the Tunnelblick icon on the top right of your screen and select "disconnect"). Then, log back into the VPN using your '''new''' password. It will send another Duo MFA push to your phone, then you should be logged in! Then feel free to ssh to any of our firewalled servers (using your new password). Note the following page for available resources: https://giwiki.gi.ucsc.edu/index.php?title=Firewalled_Computing_Resources_Overview As always, if you have any questions, please email '''cluster-admin@soe.ucsc.edu''' for help. d0183b8a43607bec5b67bd958b147ff5247b22e2 623 605 2025-02-11T00:17:28Z Weiler 3 wikitext text/x-wiki '''Before''' following these instructions, please ensure that you have filled out an account request form and completed all the training and requirements as detailed here: [[Requirement_for_users_to_get_GI_VPN_access]] After completing those requirements, you should have received a welcome email from us explaining that your account is ready. Once you have received that email, continue following these instructions. For MacOS, you will be installing "Tunnelblick", an OpenVPN client software package for Mac. Do not install this software on public or shared computers! Before installing Tunnelblick, you must have enrolled your cell phone for Duo MFA using your CruzID account with UCSC. Most folks already have this from when they first started at UCSC. If you don't yet have a CruzID, please contact your sponsor/PI and ask them to help you acquire a CruzID. If you have a CruzID but haven't yet enrolled your cell phone, please follow the instructions here to enroll your phone: https://its.ucsc.edu/mfa/enroll.html After confirming your cell phone MFA enrollment, or if you have already done this a while ago, continue to install Tunnelblick. Download the OpenVPN configuration file we will be using. The username and password to access this web link should have been sent to you in your account creation welcome email: https://giwiki.gi.ucsc.edu/downloads/prism-duo.ovpn And save that file to your Desktop. Next, you will need to download Tunnelblick (the latest Stable Version) from this link: https://tunnelblick.net/downloads.html Once you have downloaded Tunnelblick, double-click on it and proceed through the installation steps. At the end it will ask if you have any configuration files, say "Yes" and click "OK". After installation, in your Finder, you may want to navigate to the Applications folder and drag the Tunnelblick icon to your dock for easy launching. After launching Tunnelblick from the Applications folder, you will see a small "tunnel" icon on the top right of your screen, next to the date and WiFi icon. Drag the configuration file ('''prism-duo.ovpn''') on your Desktop to the little Tunnelblick icon on the top right of your screen to install the new profile. It will ask you to type in your laptop password (do that). It will also ask you if you want to install for "Only You" or "All Users". Select "Only You". You should be able to click on that icon again in the top right, then click "Connect prism-duo" from the resulting menu to start the VPN. "Prism" is the name of our firewalled environment. Use the username and temporary password that we sent to you in your account creation welcome email to login to the VPN for the first time. After typing in your username and password, you will be sent a Duo MFA push to your phone. Accept that push, and then you will be connected. Once you authenticate to the VPN (username/password/MFA), then login via SSH to 'mustard.prism' for example, and you will be asked to change your password. If you are not familiar with SSH, then you will need to open the "Terminal" application which can be found in your Applications Folder under "Utilities". After launching "Terminal" you will connect to mustard by typing: ssh username@mustard.prism Where "username" is the username we sent you in the welcome email (incidentally it is also your CruzID username). It will ask you for a password, just type in the password we sent you in your account creation welcome email. When you type the password, the characters '''will not''' echo to the screen, so it will not show you what you are typing. Once logging in successfully to mustard, it will as you to change your password. It will ask for you current password one more time, then it will ask you to choose a new password, which you will need to enter two times. Again, whatever password you choose '''will not''' echo to the screen. You new password must be: 1: At least 10 characters long 2: At least 3 character classes (lowercase, uppercase, number and/or special character) Once you change your password, it will log you out of mustard. '''Then, log out of the VPN''' (click the Tunnelblick icon on the top right of your screen and select "disconnect"). This step is very important. Then, log back into the VPN using your '''new''' password. It will send another Duo MFA push to your phone, then you should be logged in! Then feel free to ssh to any of our firewalled servers (using your new password). Note the following page for available resources: https://giwiki.gi.ucsc.edu/index.php?title=Firewalled_Computing_Resources_Overview As always, if you have any questions, please email '''cluster-admin@soe.ucsc.edu''' for help. c4be5384b04c0227701f2651303cac3ec323bc5c 640 623 2025-02-13T19:06:00Z Weiler 3 wikitext text/x-wiki '''Before''' following these instructions, please ensure that you have filled out an account request form and completed all the training and requirements as detailed here: [[Requirement_for_users_to_get_GI_VPN_access]] After completing those requirements, you should have received a welcome email from us explaining that your account is ready. Once you have received that email, continue following these instructions. For MacOS, you will be installing "Tunnelblick", an OpenVPN client software package for Mac. Do not install this software on public or shared computers! Before installing Tunnelblick, you must have enrolled your cell phone for Duo MFA using your CruzID account with UCSC. Most folks already have this from when they first started at UCSC. If you don't yet have a CruzID, please contact your sponsor/PI and ask them to help you acquire a CruzID. If you have a CruzID but haven't yet enrolled your cell phone, please follow the instructions here to enroll your phone: https://its.ucsc.edu/mfa/enroll.html After confirming your cell phone MFA enrollment, or if you have already done this a while ago, continue to install Tunnelblick. Download the OpenVPN configuration file we will be using. The username and password to access this web link should have been sent to you in your account creation welcome email: https://giwiki.gi.ucsc.edu/downloads/prism-duo.ovpn Download that file to your Desktop by right-clicking on the link above and selecting "Save Link As...", and save it to your Desktop or some other area you will remember. Next, you will need to download Tunnelblick (the latest Stable Version) from this link: https://tunnelblick.net/downloads.html Once you have downloaded Tunnelblick, double-click on it and proceed through the installation steps. At the end it will ask if you have any configuration files, say "Yes" and click "OK". After installation, in your Finder, you may want to navigate to the Applications folder and drag the Tunnelblick icon to your dock for easy launching. After launching Tunnelblick from the Applications folder, you will see a small "tunnel" icon on the top right of your screen, next to the date and WiFi icon. Drag the configuration file ('''prism-duo.ovpn''') on your Desktop to the little Tunnelblick icon on the top right of your screen to install the new profile. It will ask you to type in your laptop password (do that). It will also ask you if you want to install for "Only You" or "All Users". Select "Only You". You should be able to click on that icon again in the top right, then click "Connect prism-duo" from the resulting menu to start the VPN. "Prism" is the name of our firewalled environment. Use the username and temporary password that we sent to you in your account creation welcome email to login to the VPN for the first time. After typing in your username and password, you will be sent a Duo MFA push to your phone. Accept that push, and then you will be connected. Once you authenticate to the VPN (username/password/MFA), then login via SSH to 'mustard.prism' for example, and you will be asked to change your password. If you are not familiar with SSH, then you will need to open the "Terminal" application which can be found in your Applications Folder under "Utilities". After launching "Terminal" you will connect to mustard by typing: ssh username@mustard.prism Where "username" is the username we sent you in the welcome email (incidentally it is also your CruzID username). It will ask you for a password, just type in the password we sent you in your account creation welcome email. When you type the password, the characters '''will not''' echo to the screen, so it will not show you what you are typing. Once logging in successfully to mustard, it will as you to change your password. It will ask for you current password one more time, then it will ask you to choose a new password, which you will need to enter two times. Again, whatever password you choose '''will not''' echo to the screen. You new password must be: 1: At least 10 characters long 2: At least 3 character classes (lowercase, uppercase, number and/or special character) Once you change your password, it will log you out of mustard. '''Then, log out of the VPN''' (click the Tunnelblick icon on the top right of your screen and select "disconnect"). This step is very important. Then, log back into the VPN using your '''new''' password. It will send another Duo MFA push to your phone, then you should be logged in! Then feel free to ssh to any of our firewalled servers (using your new password). Note the following page for available resources: https://giwiki.gi.ucsc.edu/index.php?title=Firewalled_Computing_Resources_Overview As always, if you have any questions, please email '''cluster-admin@soe.ucsc.edu''' for help. 4ab124504b7dcfb75e4fe8a409ce2a77942e8205 Setting Up The VPN on Windows 0 68 606 601 2025-02-10T21:21:25Z Weiler 3 wikitext text/x-wiki '''Before''' following these instructions, please ensure that you have filled out an account request form and completed all the training and requirements as detailed here: [[Requirement_for_users_to_get_GI_VPN_access]] After completing those requirements, you should have received a welcome email from us explaining that your account is ready. Once you have received that email, continue following these instructions. We will be installing OpenVPN Connect client for Windows. This VPN client currently supports '''Windows 10''' and '''Windows 11'''. Do not install this software on public or shared computers! Before installing our VPN profile, you must have enrolled your cell phone for Duo MFA using your CruzID account with UCSC. Most folks already have this from when they first started at UCSC. If you don't yet have a CruzID, please contact your sponsor/PI and ask them to help you acquire a CruzID. If you have a CruzID but haven't yet enrolled your cell phone, please follow the instructions here to enroll your phone: https://its.ucsc.edu/mfa/enroll.html After confirming your cell phone MFA enrollment, or if you have already done this a while ago, continue to install our VPN profile. You will need to download our OpenVPN client configuration file from this link: https://giwiki.gi.ucsc.edu/downloads/prism.ovpn The username and password to access that web link will be sent to you in your account creation welcome email. Save the "prism.ovpn" file locally to your desktop or somewhere else convenient. After downloading the OpenVPN configuration file, you will need to download the OpenVPN Connect client from here: https://openvpn.net/community-downloads/ Download the '''Windows 64-bit MSI installer'''. Double click on the Installer to begin installation, and follow the on-screen prompts to complete installation. Once the installation is complete, launch the '''OpenVPN Connect''' application. Review and agree to the '''Data Usage Policy'''. After opening the app, we will need to import the '''prism-duo.ovpn''' file you downloaded earlier. Click '''File''' and browse to the location of your '''prism-duo.ovpn''' file. Import the file. You should now be able to select the new profile and click connect. It should ask you for a username and password, which we will have sent you in our welcome email. After entering the username and password, you will receive a Duo Push on your phone in order to complete authentication. The OpenVPN status icon will appear in your system tray on the bottom right of your desktop. If it is yellow or red, that indicates you are not connected. If it is green, that indicates you are connected. Once you authenticate to the VPN (username/password/MFA), then login via SSH to 'mustard.prism' for example, and you will be asked to change your password. You can access SSH via several common Windows SSH client such as PuTTY. You can also use SSH right from a Command Prompt or Power Shell window (do a Windows search for "command" or "power shell" to find them). Once you have launched one of those applications, do: ssh username@mustard.prism Where "username" is the username we sent you in the welcome email (incidentally it is also your CruzID username). It will ask you for a password, just type in the password we sent you in your account creation welcome email. When you type the password, the characters '''will not''' echo to the screen, so it will not show you what you are typing. Once logging in successfully to mustard, it will as you to change your password. It will ask for you current password one more time, then it will ask you to choose a new password, which you will need to enter two times. Again, whatever password you choose '''will not''' echo to the screen. You new password must be: 1: At least 10 characters long 2: At least 3 character classes (lowercase, uppercase, number and/or special character) Once you change your password, it will log you out of mustard. Then, log out of the VPN (select "Disconnect" from the '''OpenVPN Connect''' application). Then, log back into the VPN using your '''new''' password. It will send another Duo MFA push to your phone, then you should be logged in! Then feel free to ssh to any of our firewalled servers (using your new password). Note the following page for available resources: https://giwiki.gi.ucsc.edu/index.php?title=Firewalled_Computing_Resources_Overview As always, if you have any questions, please email '''cluster-admin@soe.ucsc.edu''' for help. 1a45eebcb147564c92faf9177f9c9b2f9e8fc45f 624 606 2025-02-11T00:19:01Z Weiler 3 wikitext text/x-wiki '''Before''' following these instructions, please ensure that you have filled out an account request form and completed all the training and requirements as detailed here: [[Requirement_for_users_to_get_GI_VPN_access]] After completing those requirements, you should have received a welcome email from us explaining that your account is ready. Once you have received that email, continue following these instructions. We will be installing OpenVPN Connect client for Windows. This VPN client currently supports '''Windows 10''' and '''Windows 11'''. Do not install this software on public or shared computers! Before installing our VPN profile, you must have enrolled your cell phone for Duo MFA using your CruzID account with UCSC. Most folks already have this from when they first started at UCSC. If you don't yet have a CruzID, please contact your sponsor/PI and ask them to help you acquire a CruzID. If you have a CruzID but haven't yet enrolled your cell phone, please follow the instructions here to enroll your phone: https://its.ucsc.edu/mfa/enroll.html After confirming your cell phone MFA enrollment, or if you have already done this a while ago, continue to install our VPN profile. You will need to download our OpenVPN client configuration file from this link: https://giwiki.gi.ucsc.edu/downloads/prism.ovpn The username and password to access that web link will be sent to you in your account creation welcome email. Save the "prism.ovpn" file locally to your desktop or somewhere else convenient. After downloading the OpenVPN configuration file, you will need to download the OpenVPN Connect client from here: https://openvpn.net/community-downloads/ Download the '''Windows 64-bit MSI installer'''. Double click on the Installer to begin installation, and follow the on-screen prompts to complete installation. Once the installation is complete, launch the '''OpenVPN Connect''' application. Review and agree to the '''Data Usage Policy'''. After opening the app, we will need to import the '''prism-duo.ovpn''' file you downloaded earlier. Click '''File''' and browse to the location of your '''prism-duo.ovpn''' file. Import the file. You should now be able to select the new profile and click connect. It should ask you for a username and password, which we will have sent you in our welcome email. After entering the username and password, you will receive a Duo Push on your phone in order to complete authentication. The OpenVPN status icon will appear in your system tray on the bottom right of your desktop. If it is yellow or red, that indicates you are not connected. If it is green, that indicates you are connected. Once you authenticate to the VPN (username/password/MFA), then login via SSH to 'mustard.prism' for example, and you will be asked to change your password. You can access SSH via several common Windows SSH client such as PuTTY. You can also use SSH right from a Command Prompt or Power Shell window (do a Windows search for "command" or "power shell" to find them). Once you have launched one of those applications, do: ssh username@mustard.prism Where "username" is the username we sent you in the welcome email (incidentally it is also your CruzID username). It will ask you for a password, just type in the password we sent you in your account creation welcome email. When you type the password, the characters '''will not''' echo to the screen, so it will not show you what you are typing. Once logging in successfully to mustard, it will as you to change your password. It will ask for you current password one more time, then it will ask you to choose a new password, which you will need to enter two times. Again, whatever password you choose '''will not''' echo to the screen. You new password must be: 1: At least 10 characters long 2: At least 3 character classes (lowercase, uppercase, number and/or special character) Once you change your password, it will log you out of mustard. '''Then, log out of the VPN''' (select "Disconnect" from the '''OpenVPN Connect''' application). This step is very important! Then, log back into the VPN using your '''new''' password. It will send another Duo MFA push to your phone, then you should be logged in! Then feel free to ssh to any of our firewalled servers (using your new password). Note the following page for available resources: https://giwiki.gi.ucsc.edu/index.php?title=Firewalled_Computing_Resources_Overview As always, if you have any questions, please email '''cluster-admin@soe.ucsc.edu''' for help. 5c7f8bc783c46180eca5f789b6cdecbba3fd080e 633 624 2025-02-12T22:14:28Z Weiler 3 wikitext text/x-wiki '''Before''' following these instructions, please ensure that you have filled out an account request form and completed all the training and requirements as detailed here: [[Requirement_for_users_to_get_GI_VPN_access]] After completing those requirements, you should have received a welcome email from us explaining that your account is ready. Once you have received that email, continue following these instructions. We will be installing OpenVPN Connect client for Windows. This VPN client currently supports '''Windows 10''' and '''Windows 11'''. Do not install this software on public or shared computers! Before installing our VPN profile, you must have enrolled your cell phone for Duo MFA using your CruzID account with UCSC. Most folks already have this from when they first started at UCSC. If you don't yet have a CruzID, please contact your sponsor/PI and ask them to help you acquire a CruzID. If you have a CruzID but haven't yet enrolled your cell phone, please follow the instructions here to enroll your phone: https://its.ucsc.edu/mfa/enroll.html After confirming your cell phone MFA enrollment, or if you have already done this a while ago, continue to install our VPN profile. You will need to download our OpenVPN client configuration file from this link: https://giwiki.gi.ucsc.edu/downloads/prism.ovpn The username and password to access that web link will be sent to you in your account creation welcome email. Save the "prism.ovpn" file locally to your desktop or somewhere else convenient. After downloading the OpenVPN configuration file, you will need to download the OpenVPN Connect client from here: https://openvpn.net/client/ Download the '''Windows Installer'''. Double click on the Installer to begin installation, and follow the on-screen prompts to complete installation. Once the installation is complete, launch the '''OpenVPN Connect''' application. Review and agree to the '''Data Usage Policy'''. After opening the app, we will need to import the '''prism-duo.ovpn''' file you downloaded earlier. Click '''File''' and browse to the location of your '''prism-duo.ovpn''' file. Import the file. You should now be able to select the new profile and click connect. It should ask you for a username and password, which we will have sent you in our welcome email. After entering the username and password, you will receive a Duo Push on your phone in order to complete authentication. The OpenVPN status icon will appear in your system tray on the bottom right of your desktop. If it is yellow or red, that indicates you are not connected. If it is green, that indicates you are connected. Once you authenticate to the VPN (username/password/MFA), then login via SSH to 'mustard.prism' for example, and you will be asked to change your password. You can access SSH via several common Windows SSH client such as PuTTY. You can also use SSH right from a Command Prompt or Power Shell window (do a Windows search for "command" or "power shell" to find them). Once you have launched one of those applications, do: ssh username@mustard.prism Where "username" is the username we sent you in the welcome email (incidentally it is also your CruzID username). It will ask you for a password, just type in the password we sent you in your account creation welcome email. When you type the password, the characters '''will not''' echo to the screen, so it will not show you what you are typing. Once logging in successfully to mustard, it will as you to change your password. It will ask for you current password one more time, then it will ask you to choose a new password, which you will need to enter two times. Again, whatever password you choose '''will not''' echo to the screen. You new password must be: 1: At least 10 characters long 2: At least 3 character classes (lowercase, uppercase, number and/or special character) Once you change your password, it will log you out of mustard. '''Then, log out of the VPN''' (select "Disconnect" from the '''OpenVPN Connect''' application). This step is very important! Then, log back into the VPN using your '''new''' password. It will send another Duo MFA push to your phone, then you should be logged in! Then feel free to ssh to any of our firewalled servers (using your new password). Note the following page for available resources: https://giwiki.gi.ucsc.edu/index.php?title=Firewalled_Computing_Resources_Overview As always, if you have any questions, please email '''cluster-admin@soe.ucsc.edu''' for help. 9294f5067167d5bc78f4565417fa2d8a24f852bb 641 633 2025-02-13T19:06:35Z Weiler 3 wikitext text/x-wiki '''Before''' following these instructions, please ensure that you have filled out an account request form and completed all the training and requirements as detailed here: [[Requirement_for_users_to_get_GI_VPN_access]] After completing those requirements, you should have received a welcome email from us explaining that your account is ready. Once you have received that email, continue following these instructions. We will be installing OpenVPN Connect client for Windows. This VPN client currently supports '''Windows 10''' and '''Windows 11'''. Do not install this software on public or shared computers! Before installing our VPN profile, you must have enrolled your cell phone for Duo MFA using your CruzID account with UCSC. Most folks already have this from when they first started at UCSC. If you don't yet have a CruzID, please contact your sponsor/PI and ask them to help you acquire a CruzID. If you have a CruzID but haven't yet enrolled your cell phone, please follow the instructions here to enroll your phone: https://its.ucsc.edu/mfa/enroll.html After confirming your cell phone MFA enrollment, or if you have already done this a while ago, continue to install our VPN profile. You will need to download our OpenVPN client configuration file from this link: https://giwiki.gi.ucsc.edu/downloads/prism.ovpn The username and password to access that web link will be sent to you in your account creation welcome email. Download that file by right-clicking on the link above and selecting "Save Link As...", and save it to your Desktop or some other area you will remember. After downloading the OpenVPN configuration file, you will need to download the OpenVPN Connect client from here: https://openvpn.net/client/ Download the '''Windows Installer'''. Double click on the Installer to begin installation, and follow the on-screen prompts to complete installation. Once the installation is complete, launch the '''OpenVPN Connect''' application. Review and agree to the '''Data Usage Policy'''. After opening the app, we will need to import the '''prism-duo.ovpn''' file you downloaded earlier. Click '''File''' and browse to the location of your '''prism-duo.ovpn''' file. Import the file. You should now be able to select the new profile and click connect. It should ask you for a username and password, which we will have sent you in our welcome email. After entering the username and password, you will receive a Duo Push on your phone in order to complete authentication. The OpenVPN status icon will appear in your system tray on the bottom right of your desktop. If it is yellow or red, that indicates you are not connected. If it is green, that indicates you are connected. Once you authenticate to the VPN (username/password/MFA), then login via SSH to 'mustard.prism' for example, and you will be asked to change your password. You can access SSH via several common Windows SSH client such as PuTTY. You can also use SSH right from a Command Prompt or Power Shell window (do a Windows search for "command" or "power shell" to find them). Once you have launched one of those applications, do: ssh username@mustard.prism Where "username" is the username we sent you in the welcome email (incidentally it is also your CruzID username). It will ask you for a password, just type in the password we sent you in your account creation welcome email. When you type the password, the characters '''will not''' echo to the screen, so it will not show you what you are typing. Once logging in successfully to mustard, it will as you to change your password. It will ask for you current password one more time, then it will ask you to choose a new password, which you will need to enter two times. Again, whatever password you choose '''will not''' echo to the screen. You new password must be: 1: At least 10 characters long 2: At least 3 character classes (lowercase, uppercase, number and/or special character) Once you change your password, it will log you out of mustard. '''Then, log out of the VPN''' (select "Disconnect" from the '''OpenVPN Connect''' application). This step is very important! Then, log back into the VPN using your '''new''' password. It will send another Duo MFA push to your phone, then you should be logged in! Then feel free to ssh to any of our firewalled servers (using your new password). Note the following page for available resources: https://giwiki.gi.ucsc.edu/index.php?title=Firewalled_Computing_Resources_Overview As always, if you have any questions, please email '''cluster-admin@soe.ucsc.edu''' for help. 580520e78f279d2f442ef260a9f499bdd09a05ce 649 641 2025-02-21T00:38:26Z Weiler 3 wikitext text/x-wiki '''Before''' following these instructions, please ensure that you have filled out an account request form and completed all the training and requirements as detailed here: [[Requirement_for_users_to_get_GI_VPN_access]] After completing those requirements, you should have received a welcome email from us explaining that your account is ready. Once you have received that email, continue following these instructions. We will be installing OpenVPN Connect client for Windows. This VPN client currently supports '''Windows 10''' and '''Windows 11'''. Do not install this software on public or shared computers! Before installing our VPN profile, you must have enrolled your cell phone for Duo MFA using your CruzID account with UCSC. Most folks already have this from when they first started at UCSC. If you don't yet have a CruzID, please contact your sponsor/PI and ask them to help you acquire a CruzID. If you have a CruzID but haven't yet enrolled your cell phone, please follow the instructions here to enroll your phone: https://its.ucsc.edu/mfa/enroll.html After confirming your cell phone MFA enrollment, or if you have already done this a while ago, continue to install our VPN profile. You will need to download our OpenVPN client configuration file from this link: https://giwiki.gi.ucsc.edu/downloads/ The username and password to access that web link will be sent to you in your account creation welcome email. Download that '''prism.ovpn''' file by right-clicking on the link on website above and selecting "Save Link As...", and save it to your Desktop or some other area you will remember. After downloading the OpenVPN configuration file, you will need to download the OpenVPN Connect client from here: https://openvpn.net/client/ Download the '''Windows Installer'''. Double click on the Installer to begin installation, and follow the on-screen prompts to complete installation. Once the installation is complete, launch the '''OpenVPN Connect''' application. Review and agree to the '''Data Usage Policy'''. After opening the app, we will need to import the '''prism-duo.ovpn''' file you downloaded earlier. Click '''File''' and browse to the location of your '''prism-duo.ovpn''' file. Import the file. You should now be able to select the new profile and click connect. It should ask you for a username and password, which we will have sent you in our welcome email. After entering the username and password, you will receive a Duo Push on your phone in order to complete authentication. The OpenVPN status icon will appear in your system tray on the bottom right of your desktop. If it is yellow or red, that indicates you are not connected. If it is green, that indicates you are connected. Once you authenticate to the VPN (username/password/MFA), then login via SSH to 'mustard.prism' for example, and you will be asked to change your password. You can access SSH via several common Windows SSH client such as PuTTY. You can also use SSH right from a Command Prompt or Power Shell window (do a Windows search for "command" or "power shell" to find them). Once you have launched one of those applications, do: ssh username@mustard.prism Where "username" is the username we sent you in the welcome email (incidentally it is also your CruzID username). It will ask you for a password, just type in the password we sent you in your account creation welcome email. When you type the password, the characters '''will not''' echo to the screen, so it will not show you what you are typing. Once logging in successfully to mustard, it will as you to change your password. It will ask for you current password one more time, then it will ask you to choose a new password, which you will need to enter two times. Again, whatever password you choose '''will not''' echo to the screen. You new password must be: 1: At least 10 characters long 2: At least 3 character classes (lowercase, uppercase, number and/or special character) Once you change your password, it will log you out of mustard. '''Then, log out of the VPN''' (select "Disconnect" from the '''OpenVPN Connect''' application). This step is very important! Then, log back into the VPN using your '''new''' password. It will send another Duo MFA push to your phone, then you should be logged in! Then feel free to ssh to any of our firewalled servers (using your new password). Note the following page for available resources: https://giwiki.gi.ucsc.edu/index.php?title=Firewalled_Computing_Resources_Overview As always, if you have any questions, please email '''cluster-admin@soe.ucsc.edu''' for help. 0f0eae1b2929cfcde206da05fbcc1dd5d40f5f49 651 649 2025-02-21T00:41:22Z Weiler 3 wikitext text/x-wiki '''Before''' following these instructions, please ensure that you have filled out an account request form and completed all the training and requirements as detailed here: [[Requirement_for_users_to_get_GI_VPN_access]] After completing those requirements, you should have received a welcome email from us explaining that your account is ready. Once you have received that email, continue following these instructions. We will be installing OpenVPN Connect client for Windows. This VPN client currently supports '''Windows 10''' and '''Windows 11'''. Do not install this software on public or shared computers! Before installing our VPN profile, you must have enrolled your cell phone for Duo MFA using your CruzID account with UCSC. Most folks already have this from when they first started at UCSC. If you don't yet have a CruzID, please contact your sponsor/PI and ask them to help you acquire a CruzID. If you have a CruzID but haven't yet enrolled your cell phone, please follow the instructions here to enroll your phone: https://its.ucsc.edu/mfa/enroll.html After confirming your cell phone MFA enrollment, or if you have already done this a while ago, continue to install our VPN profile. You will need to download our OpenVPN client configuration file from this link: https://giwiki.gi.ucsc.edu/downloads/prism-duo.ovpn The username and password to access that web link will be sent to you in your account creation welcome email. Download the '''prism-duo.ovpn''' file by right-clicking on the link above and selecting "Save Link As...", and save it to your Desktop or some other area you will remember. After downloading the OpenVPN configuration file, you will need to download the OpenVPN Connect client from here: https://openvpn.net/client/ Download the '''Windows Installer'''. Double click on the Installer to begin installation, and follow the on-screen prompts to complete installation. Once the installation is complete, launch the '''OpenVPN Connect''' application. Review and agree to the '''Data Usage Policy'''. After opening the app, we will need to import the '''prism-duo.ovpn''' file you downloaded earlier. Click '''File''' and browse to the location of your '''prism-duo.ovpn''' file. Import the file. You should now be able to select the new profile and click connect. It should ask you for a username and password, which we will have sent you in our welcome email. After entering the username and password, you will receive a Duo Push on your phone in order to complete authentication. The OpenVPN status icon will appear in your system tray on the bottom right of your desktop. If it is yellow or red, that indicates you are not connected. If it is green, that indicates you are connected. Once you authenticate to the VPN (username/password/MFA), then login via SSH to 'mustard.prism' for example, and you will be asked to change your password. You can access SSH via several common Windows SSH client such as PuTTY. You can also use SSH right from a Command Prompt or Power Shell window (do a Windows search for "command" or "power shell" to find them). Once you have launched one of those applications, do: ssh username@mustard.prism Where "username" is the username we sent you in the welcome email (incidentally it is also your CruzID username). It will ask you for a password, just type in the password we sent you in your account creation welcome email. When you type the password, the characters '''will not''' echo to the screen, so it will not show you what you are typing. Once logging in successfully to mustard, it will as you to change your password. It will ask for you current password one more time, then it will ask you to choose a new password, which you will need to enter two times. Again, whatever password you choose '''will not''' echo to the screen. You new password must be: 1: At least 10 characters long 2: At least 3 character classes (lowercase, uppercase, number and/or special character) Once you change your password, it will log you out of mustard. '''Then, log out of the VPN''' (select "Disconnect" from the '''OpenVPN Connect''' application). This step is very important! Then, log back into the VPN using your '''new''' password. It will send another Duo MFA push to your phone, then you should be logged in! Then feel free to ssh to any of our firewalled servers (using your new password). Note the following page for available resources: https://giwiki.gi.ucsc.edu/index.php?title=Firewalled_Computing_Resources_Overview As always, if you have any questions, please email '''cluster-admin@soe.ucsc.edu''' for help. 09fa27c2a4a82bf7db76c6b56348aef0c201e65d Setting Up The VPN on Linux 0 62 607 602 2025-02-10T21:22:11Z Weiler 3 wikitext text/x-wiki '''Before''' following these instructions, please ensure that you have filled out an account request form and completed all the training and requirements as detailed here: [[Requirement_for_users_to_get_GI_VPN_access]] After completing those requirements, you should have received a welcome email from us explaining that your account is ready. Once you have received that email, continue following these instructions. Most Linux flavors support OpenVPN client software. While the installation process may vary from flavor to flavor, we will be describing the process to get you going for Ubuntu, which should work on most Ubuntu versions and other Ubuntu/Debian derivatives. Do not install this software on public or shared computers! Before installing our VPN profile, you must have enrolled your cell phone for Duo MFA using your CruzID account with UCSC. Most folks already have this from when they first started at UCSC. If you don't yet have a CruzID, please contact your sponsor/PI and ask them to help you acquire a CruzID. If you have a CruzID but haven't yet enrolled your cell phone, please follow the instructions here to enroll your phone: https://its.ucsc.edu/mfa/enroll.html After confirming your cell phone MFA enrollment, or if you have already done this a while ago, continue to install our VPN profile. You will need to download our OpenVPN client configuration file from this link: https://giwiki.gi.ucsc.edu/downloads/prism-duo.ovpn The username and password to access that web link will be sent to you in your account creation welcome email. Save the "prism-duo.ovpn" file locally to your desktop or somewhere else convenient. We will be installing the Prism VPN profile via the Network Manager GUI interface. Open '''Network Manager''' from '''Gnome Settings''' option and select the '''Network''' tab and click on the '''VPN +''' symbol: [[File:Configuring_1.png|600px]] From the '''Add VPN''' window, click on the '''Import from file...''' option: [[File:Configuring_2.png|600px]] You must navigate to your .ovpn file (/path/to/your/prism-duo.ovpn) and click on '''Open''' button: [[File:Configuring_3.png|600px]] Click on the '''Add''' button: [[File:Configuring_4.png|600px]] Finally, click the '''On/Off''' button to start on the VPN: [[File:Configuring_5.png|600px]] Once you authenticate to the VPN (username/password/MFA), then login via SSH to 'mustard.prism' for example, and you will be asked to change your password. ssh username@mustard.prism Where "username" is the username we sent you in the welcome email (incidentally it is also your CruzID username). It will ask you for a password, just type in the password we sent you in your account creation welcome email. When you type the password, the characters '''will not''' echo to the screen, so it will not show you what you are typing. Once logging in successfully to mustard, it will as you to change your password. It will ask for you current password one more time, then it will ask you to choose a new password, which you will need to enter two times. Again, whatever password you choose '''will not''' echo to the screen. You new password must be: 1: At least 10 characters long 2: At least 3 character classes (lowercase, uppercase, number and/or special character) Once you change your password, it will log you out of mustard. Then, log out of the VPN (toggle the '''On/Off''' button from the Network Manager GUI VPN interface). Then, log back into the VPN using your '''new''' password. It will send another Duo MFA push to your phone, then you should be logged in! Then feel free to ssh to any of our firewalled servers (using your new password). Note the following page for available resources: https://giwiki.gi.ucsc.edu/index.php?title=Firewalled_Computing_Resources_Overview As always, if you have any questions, please email '''cluster-admin@soe.ucsc.edu''' for help. 80863c381c21dc002319a473a1d106b846cf1053 625 607 2025-02-11T00:19:25Z Weiler 3 wikitext text/x-wiki '''Before''' following these instructions, please ensure that you have filled out an account request form and completed all the training and requirements as detailed here: [[Requirement_for_users_to_get_GI_VPN_access]] After completing those requirements, you should have received a welcome email from us explaining that your account is ready. Once you have received that email, continue following these instructions. Most Linux flavors support OpenVPN client software. While the installation process may vary from flavor to flavor, we will be describing the process to get you going for Ubuntu, which should work on most Ubuntu versions and other Ubuntu/Debian derivatives. Do not install this software on public or shared computers! Before installing our VPN profile, you must have enrolled your cell phone for Duo MFA using your CruzID account with UCSC. Most folks already have this from when they first started at UCSC. If you don't yet have a CruzID, please contact your sponsor/PI and ask them to help you acquire a CruzID. If you have a CruzID but haven't yet enrolled your cell phone, please follow the instructions here to enroll your phone: https://its.ucsc.edu/mfa/enroll.html After confirming your cell phone MFA enrollment, or if you have already done this a while ago, continue to install our VPN profile. You will need to download our OpenVPN client configuration file from this link: https://giwiki.gi.ucsc.edu/downloads/prism-duo.ovpn The username and password to access that web link will be sent to you in your account creation welcome email. Save the "prism-duo.ovpn" file locally to your desktop or somewhere else convenient. We will be installing the Prism VPN profile via the Network Manager GUI interface. Open '''Network Manager''' from '''Gnome Settings''' option and select the '''Network''' tab and click on the '''VPN +''' symbol: [[File:Configuring_1.png|600px]] From the '''Add VPN''' window, click on the '''Import from file...''' option: [[File:Configuring_2.png|600px]] You must navigate to your .ovpn file (/path/to/your/prism-duo.ovpn) and click on '''Open''' button: [[File:Configuring_3.png|600px]] Click on the '''Add''' button: [[File:Configuring_4.png|600px]] Finally, click the '''On/Off''' button to start on the VPN: [[File:Configuring_5.png|600px]] Once you authenticate to the VPN (username/password/MFA), then login via SSH to 'mustard.prism' for example, and you will be asked to change your password. ssh username@mustard.prism Where "username" is the username we sent you in the welcome email (incidentally it is also your CruzID username). It will ask you for a password, just type in the password we sent you in your account creation welcome email. When you type the password, the characters '''will not''' echo to the screen, so it will not show you what you are typing. Once logging in successfully to mustard, it will as you to change your password. It will ask for you current password one more time, then it will ask you to choose a new password, which you will need to enter two times. Again, whatever password you choose '''will not''' echo to the screen. You new password must be: 1: At least 10 characters long 2: At least 3 character classes (lowercase, uppercase, number and/or special character) Once you change your password, it will log you out of mustard. '''Then, log out of the VPN''' (toggle the '''On/Off''' button from the Network Manager GUI VPN interface). This step is very important! Then, log back into the VPN using your '''new''' password. It will send another Duo MFA push to your phone, then you should be logged in! Then feel free to ssh to any of our firewalled servers (using your new password). Note the following page for available resources: https://giwiki.gi.ucsc.edu/index.php?title=Firewalled_Computing_Resources_Overview As always, if you have any questions, please email '''cluster-admin@soe.ucsc.edu''' for help. 87e9f64c211703378a639260335fa48dbc9a398a 642 625 2025-02-13T19:06:59Z Weiler 3 wikitext text/x-wiki '''Before''' following these instructions, please ensure that you have filled out an account request form and completed all the training and requirements as detailed here: [[Requirement_for_users_to_get_GI_VPN_access]] After completing those requirements, you should have received a welcome email from us explaining that your account is ready. Once you have received that email, continue following these instructions. Most Linux flavors support OpenVPN client software. While the installation process may vary from flavor to flavor, we will be describing the process to get you going for Ubuntu, which should work on most Ubuntu versions and other Ubuntu/Debian derivatives. Do not install this software on public or shared computers! Before installing our VPN profile, you must have enrolled your cell phone for Duo MFA using your CruzID account with UCSC. Most folks already have this from when they first started at UCSC. If you don't yet have a CruzID, please contact your sponsor/PI and ask them to help you acquire a CruzID. If you have a CruzID but haven't yet enrolled your cell phone, please follow the instructions here to enroll your phone: https://its.ucsc.edu/mfa/enroll.html After confirming your cell phone MFA enrollment, or if you have already done this a while ago, continue to install our VPN profile. You will need to download our OpenVPN client configuration file from this link: https://giwiki.gi.ucsc.edu/downloads/prism-duo.ovpn The username and password to access that web link will be sent to you in your account creation welcome email. Download that file by right-clicking on the link above and selecting "Save Link As...", and save it to your Desktop or some other area you will remember. We will be installing the Prism VPN profile via the Network Manager GUI interface. Open '''Network Manager''' from '''Gnome Settings''' option and select the '''Network''' tab and click on the '''VPN +''' symbol: [[File:Configuring_1.png|600px]] From the '''Add VPN''' window, click on the '''Import from file...''' option: [[File:Configuring_2.png|600px]] You must navigate to your .ovpn file (/path/to/your/prism-duo.ovpn) and click on '''Open''' button: [[File:Configuring_3.png|600px]] Click on the '''Add''' button: [[File:Configuring_4.png|600px]] Finally, click the '''On/Off''' button to start on the VPN: [[File:Configuring_5.png|600px]] Once you authenticate to the VPN (username/password/MFA), then login via SSH to 'mustard.prism' for example, and you will be asked to change your password. ssh username@mustard.prism Where "username" is the username we sent you in the welcome email (incidentally it is also your CruzID username). It will ask you for a password, just type in the password we sent you in your account creation welcome email. When you type the password, the characters '''will not''' echo to the screen, so it will not show you what you are typing. Once logging in successfully to mustard, it will as you to change your password. It will ask for you current password one more time, then it will ask you to choose a new password, which you will need to enter two times. Again, whatever password you choose '''will not''' echo to the screen. You new password must be: 1: At least 10 characters long 2: At least 3 character classes (lowercase, uppercase, number and/or special character) Once you change your password, it will log you out of mustard. '''Then, log out of the VPN''' (toggle the '''On/Off''' button from the Network Manager GUI VPN interface). This step is very important! Then, log back into the VPN using your '''new''' password. It will send another Duo MFA push to your phone, then you should be logged in! Then feel free to ssh to any of our firewalled servers (using your new password). Note the following page for available resources: https://giwiki.gi.ucsc.edu/index.php?title=Firewalled_Computing_Resources_Overview As always, if you have any questions, please email '''cluster-admin@soe.ucsc.edu''' for help. 709528375ace175e349d47a6fd9fe74fec810cb0 Genomics Institute Computing Information 0 6 611 603 2025-02-10T22:33:37Z Weiler 3 wikitext text/x-wiki Welcome to the Genomic Institute Computing Information Repository! Browse the below topics for help in the area you are curious about. == GI Public Computing Environment == *[[How to access the public servers]] == GI Firewalled Computing Environment (PRISM) == *[[Access to the Firewalled Compute Servers]] *[[Firewalled Computing Resources Overview]] *[[Firewalled Environment Storage Overview]] *[[Firewalled User Account and Storage Cost]] *[[Grafana Performance Metrics]] *[[Visual Studio Code (vscode) Configuration Tweaks]] *[http://logserv.gi.ucsc.edu/cgi-bin/private-groups.cgi '''/private/groups''' Data Usage Graphs] *[[Resetting your VPN/PRISM Password]] ==VPN Access== *[[Requirement for users to get GI VPN access]] *[[Setting Up The VPN on MacOS]] *[[Setting Up The VPN on Windows]] *[[Setting Up The VPN on Linux]] *[[Multi Factor Authentication (MFA) Frequently Asked Questions]] *[[Converting From Non-MFA VPN to the MFA-Enabled VPN on MacOS]] *[[Converting From Non-MFA VPN to the MFA-Enabled VPN on Windows]] *[[Converting From Non-MFA VPN to the MFA-Enabled VPN on Linux]] == NIH dbGaP Access Requirements == *[[Requirements for dbGaP Access]] == giCloud Openstack == *[[Overview of giCloud in the Genomics Institute]] *[[Quick Start Instructions to Get Rolling with OpenStack]] == Amazon Web Services Information == *[[Overview of Getting and Using an AWS IAM Account]] *[[AWS Account List and Numbers]] *[[AWS Shared Bucket Usage Graphs]] *[[AWS Best Practices]] *[[AWS S3 Lifecycle Management]] == Slurm at the Genomics Institute == *[[Overview of using Slurm]] *[[Cluster Etiquette]] *[[Annotated Slurm Script]] *[[Job Arrays]] *[[GPU Resources]] *[[Quick Reference Guide]] *[[Convenient Slurm Commands]] *[[Slurm Queues (Partitions) and Resource Management]] *[[Slurm Tips for vg]] *[[Slurm Tips for Toil]] *[[Using Docker under Slurm]] *[[Phoenix WDL Tutorial]] ==General Docker Information== *[[Running a Container as a non-root User]] == Problems or technical support == If you have any problems with the GI computing environment, please send an email to '''cluster-admin@soe.ucsc.edu''' d6ed41e0729a8499a6c36f4d8ddc7dbcb60cb62c 653 611 2025-03-15T15:25:56Z Weiler 3 /* VPN Access */ wikitext text/x-wiki Welcome to the Genomic Institute Computing Information Repository! Browse the below topics for help in the area you are curious about. == GI Public Computing Environment == *[[How to access the public servers]] == GI Firewalled Computing Environment (PRISM) == *[[Access to the Firewalled Compute Servers]] *[[Firewalled Computing Resources Overview]] *[[Firewalled Environment Storage Overview]] *[[Firewalled User Account and Storage Cost]] *[[Grafana Performance Metrics]] *[[Visual Studio Code (vscode) Configuration Tweaks]] *[http://logserv.gi.ucsc.edu/cgi-bin/private-groups.cgi '''/private/groups''' Data Usage Graphs] *[[Resetting your VPN/PRISM Password]] ==VPN Access== *[[Requirement for users to get GI VPN access]] *[[Setting Up The VPN on MacOS]] *[[Setting Up The VPN on Windows]] *[[Setting Up The VPN on Linux]] *[[Multi Factor Authentication (MFA) Frequently Asked Questions]] *[[Converting From Non-MFA VPN to the MFA-Enabled VPN on MacOS]] *[[Converting From Non-MFA VPN to the MFA-Enabled VPN on Windows]] *[[Converting From Non-MFA VPN to the MFA-Enabled VPN on Linux]] *[[Duo Pushes Aren't Being Sent to My Phone!]] == NIH dbGaP Access Requirements == *[[Requirements for dbGaP Access]] == giCloud Openstack == *[[Overview of giCloud in the Genomics Institute]] *[[Quick Start Instructions to Get Rolling with OpenStack]] == Amazon Web Services Information == *[[Overview of Getting and Using an AWS IAM Account]] *[[AWS Account List and Numbers]] *[[AWS Shared Bucket Usage Graphs]] *[[AWS Best Practices]] *[[AWS S3 Lifecycle Management]] == Slurm at the Genomics Institute == *[[Overview of using Slurm]] *[[Cluster Etiquette]] *[[Annotated Slurm Script]] *[[Job Arrays]] *[[GPU Resources]] *[[Quick Reference Guide]] *[[Convenient Slurm Commands]] *[[Slurm Queues (Partitions) and Resource Management]] *[[Slurm Tips for vg]] *[[Slurm Tips for Toil]] *[[Using Docker under Slurm]] *[[Phoenix WDL Tutorial]] ==General Docker Information== *[[Running a Container as a non-root User]] == Problems or technical support == If you have any problems with the GI computing environment, please send an email to '''cluster-admin@soe.ucsc.edu''' 0abff570973ddd882410ea17c561468c10863c9d Converting From Non-MFA VPN to the MFA-Enabled VPN on MacOS 0 70 612 2025-02-10T23:09:42Z Weiler 3 Created page with "If you are using Tunnelblick on MacOS and you are looking to convert to the new MFA-enabled GI VPN, you have come to the right place. You must already have Duo set up with your CruzID (which most of you do). If for some reason you don't have Duo set up yet on your phone, go here to enroll a device and configure Push Notifications with Duo before continuing: https://its.ucsc.edu/mfa/enroll.html Disconnect from the VPN if you are already connected. Then you will need..." wikitext text/x-wiki If you are using Tunnelblick on MacOS and you are looking to convert to the new MFA-enabled GI VPN, you have come to the right place. You must already have Duo set up with your CruzID (which most of you do). If for some reason you don't have Duo set up yet on your phone, go here to enroll a device and configure Push Notifications with Duo before continuing: https://its.ucsc.edu/mfa/enroll.html Disconnect from the VPN if you are already connected. Then you will need to download the new OpenVPN config file from here: https://giwiki.gi.ucsc.edu/downloads/prism-duo.ovpn The credentials to access that website are username: '''genecats''' and password: '''KiloKluster''' Download that file to your Desktop. Then open Tunnelblick and click on the Tunnelblick icon on the top right of your screen next to the date. It kind of looks like a small tunnel. In the window that opens, select "VPN Details...". In the resulting window, select the "Configurations" tab on the top. You will see a list of Configurations on the left, and it should include the current configuration you use to connect. It may be called 'prism' or maybe 'client'. Drag the new configuration (on your Desktop), called '''prism-duo.ovpn''' into the Configurations area beneath your old configuration. It should import the configuration. It will ask you if you want to install it for "Only You" or "All Users". Click "Only You". You will also be asked to type in your laptop password. That's it! Select the new configuration on the left and click the "Connect" button on the bottom right. It will ask you for your usual GI PRISM username and password that you usually use to connect to our VPN, and after that it will send a Duo Push notification to your phone, and then you should be logged in. Other than the Duo Push, the VPN behaves exactly like it did before. As always, please email '''cluster-admin@soe.ucsc.edu''' if you need help or have any questions. 8921aec67d23447c679f1e260edd4322df12781b 613 612 2025-02-10T23:10:37Z Weiler 3 wikitext text/x-wiki If you are using Tunnelblick on MacOS and you are looking to convert to the new MFA-enabled GI VPN, you have come to the right place. You must already have Duo set up with your CruzID (which most of you do). If for some reason you don't have Duo set up yet on your phone, go here to enroll a device and configure Push Notifications with Duo before continuing: https://its.ucsc.edu/mfa/enroll.html Disconnect from the VPN if you are already connected. Then you will need to download the new OpenVPN config file from here: https://giwiki.gi.ucsc.edu/downloads/prism-duo.ovpn The credentials to access that website are username: '''genecats''' and password: '''KiloKluster''' Download that file to your Desktop. Then open Tunnelblick and click on the Tunnelblick icon on the top right of your screen next to the date. It kind of looks like a small tunnel. In the window that opens, select "VPN Details...". In the resulting window, select the "Configurations" tab on the top. You will see a list of Configurations on the left, and it should include the current configuration you use to connect. It may be called 'prism' or maybe 'client'. Drag the new configuration (from your Desktop), called '''prism-duo.ovpn''' into the Configurations area beneath your old configuration. It should import the configuration. It will ask you if you want to install it for "Only You" or "All Users". Click "Only You". You will also be asked to type in your laptop password. That's it! Select the new configuration on the left and click the "Connect" button on the bottom right. It will ask you for your usual GI PRISM username and password that you usually use to connect to our VPN, and after that it will send a Duo Push notification to your phone, and then you should be logged in. Other than the Duo Push, the VPN behaves exactly like it did before. As always, please email '''cluster-admin@soe.ucsc.edu''' if you need help or have any questions. 971fce488a915ec6acb3d6a443e54bd0f4bfd05e 614 613 2025-02-10T23:12:25Z Weiler 3 wikitext text/x-wiki If you are using Tunnelblick on MacOS and you are looking to convert to the new MFA-enabled GI VPN, you have come to the right place. You must already have Duo set up with your CruzID (which most of you do). If for some reason you don't have Duo set up yet on your phone, go here to enroll a device and configure Push Notifications with Duo before continuing: https://its.ucsc.edu/mfa/enroll.html Disconnect from the VPN if you are already connected. Then you will need to download the new OpenVPN config file from here: https://giwiki.gi.ucsc.edu/downloads/prism-duo.ovpn The credentials to access that website are username: '''genecats''' and password: '''KiloKluster''' Download that file to your Desktop. Then open Tunnelblick and click on the Tunnelblick icon on the top right of your screen next to the date. It kind of looks like a small tunnel. In the window that opens, select "VPN Details...". In the resulting window, select the "Configurations" tab on the top. You will see a list of Configurations on the left, and it should include the current configuration you use to connect. It may be called 'prism' or maybe 'client'. Drag the new configuration called '''prism-duo.ovpn''' (from your Desktop) into the Configurations area beneath your old configuration. It should import the configuration. It will ask you if you want to install it for "Only You" or "All Users". Click "Only You". You will also be asked to type in your laptop password. That's it! Select the new configuration on the left and click the "Connect" button on the bottom right. It will ask you for your usual GI PRISM username and password that you usually use to connect to our VPN, and after that it will send a Duo Push notification to your phone, and then you should be logged in. Other than the Duo Push, the VPN behaves exactly like it did before. If you have issues you can always revert back to the old configuration, which will still work for a while. We will disable the old VPN soon though, so make every effort to get the new VPN setup working. As always, please email '''cluster-admin@soe.ucsc.edu''' if you need help or have any questions. 3c7d17f5f3491036be71afead628a5ec4991a53a 615 614 2025-02-10T23:15:59Z Weiler 3 wikitext text/x-wiki If you are using Tunnelblick on MacOS and you are looking to convert to the new MFA-enabled GI VPN, you have come to the right place. You must already have Duo set up with your CruzID (which most of you do). If for some reason you don't have Duo set up yet on your phone, go here to enroll a device and configure Push Notifications with Duo before continuing: https://its.ucsc.edu/mfa/enroll.html Disconnect from the VPN if you are already connected. Then you will need to download the new OpenVPN config file from here: https://giwiki.gi.ucsc.edu/downloads/prism-duo.ovpn The credentials to access that website are username: '''genecats''' and password: '''KiloKluster''' Download that file to your Desktop. Then open Tunnelblick and click on the Tunnelblick icon on the top right of your screen next to the date. It kind of looks like a small tunnel. In the window that opens, select "VPN Details...". In the resulting window, select the "Configurations" tab on the top. You will see a list of Configurations on the left, and it should include the current configuration you use to connect. It may be called 'prism' or maybe 'client'. Drag the new configuration called '''prism-duo.ovpn''' (from your Desktop) into the Configurations area beneath your old configuration. It should import the configuration. It will ask you if you want to install it for "Only You" or "All Users". Click "Only You". You will also be asked to type in your laptop password. That's it! Select the new configuration on the left and click the "Connect" button on the bottom right. It will ask you for your usual GI PRISM username and password that you usually use to connect to our VPN, and after that it will send a Duo Push notification to your phone, and then you should be logged in. Other than the Duo Push, the VPN behaves exactly like it did before. If you have issues you can always revert back to the old configuration, which will still work for a while. We will disable the old VPN soon though, so make every effort to get the new VPN setup working. Once you have the new VPN working, feel free to delete the old profile from Tunnelblick but clicking on the old profile in the "Configurations" window, then click on the '''-''' button below to remove the old configuration. As always, please email '''cluster-admin@soe.ucsc.edu''' if you need help or have any questions. a88a7605f12740759e9db0240f3e412bbd814047 616 615 2025-02-10T23:16:23Z Weiler 3 wikitext text/x-wiki If you are using Tunnelblick on MacOS and you are looking to convert to the new MFA-enabled GI VPN, you have come to the right place. You must already have Duo set up with your CruzID (which most of you do). If for some reason you don't have Duo set up yet on your phone, go here to enroll a device and configure Push Notifications with Duo before continuing: https://its.ucsc.edu/mfa/enroll.html Disconnect from the VPN if you are already connected. Then you will need to download the new OpenVPN config file from here: https://giwiki.gi.ucsc.edu/downloads/prism-duo.ovpn The credentials to access that website are username: '''genecats''' and password: '''KiloKluster''' Download that file to your Desktop. Then open Tunnelblick and click on the Tunnelblick icon on the top right of your screen next to the date. It kind of looks like a small tunnel. In the window that opens, select "VPN Details...". In the resulting window, select the "Configurations" tab on the top. You will see a list of Configurations on the left, and it should include the current configuration you use to connect. It may be called 'prism' or maybe 'client'. Drag the new configuration called '''prism-duo.ovpn''' (from your Desktop) into the Configurations area beneath your old configuration. It should import the configuration. It will ask you if you want to install it for "Only You" or "All Users". Click "Only You". You will also be asked to type in your laptop password. That's it! Select the new configuration on the left and click the "Connect" button on the bottom right. It will ask you for your usual GI PRISM username and password that you usually use to connect to our VPN, and after that it will send a Duo Push notification to your phone, and then you should be logged in. Other than the Duo Push, the VPN behaves exactly like it did before. If you have issues you can always revert back to the old configuration, which will still work for a while. We will disable the old VPN soon though, so make every effort to get the new VPN setup working. Once you have the new VPN working, feel free to delete the old profile from Tunnelblick but clicking on the old profile in the "Configurations" window, then click on the '''"-"''' button below to remove the old configuration. As always, please email '''cluster-admin@soe.ucsc.edu''' if you need help or have any questions. 08929e7aafadddbb88f9ced092deb913715fac92 617 616 2025-02-10T23:16:41Z Weiler 3 wikitext text/x-wiki If you are using Tunnelblick on MacOS and you are looking to convert to the new MFA-enabled GI VPN, you have come to the right place. You must already have Duo set up with your CruzID (which most of you do). If for some reason you don't have Duo set up yet on your phone, go here to enroll a device and configure Push Notifications with Duo before continuing: https://its.ucsc.edu/mfa/enroll.html Disconnect from the VPN if you are already connected. Then you will need to download the new OpenVPN config file from here: https://giwiki.gi.ucsc.edu/downloads/prism-duo.ovpn The credentials to access that website are username: '''genecats''' and password: '''KiloKluster''' Download that file to your Desktop. Then open Tunnelblick and click on the Tunnelblick icon on the top right of your screen next to the date. It kind of looks like a small tunnel. In the window that opens, select "VPN Details...". In the resulting window, select the "Configurations" tab on the top. You will see a list of Configurations on the left, and it should include the current configuration you use to connect. It may be called 'prism' or maybe 'client'. Drag the new configuration called '''prism-duo.ovpn''' (from your Desktop) into the Configurations area beneath your old configuration. It should import the configuration. It will ask you if you want to install it for "Only You" or "All Users". Click "Only You". You will also be asked to type in your laptop password. That's it! Select the new configuration on the left and click the "Connect" button on the bottom right. It will ask you for your usual GI PRISM username and password that you usually use to connect to our VPN, and after that it will send a Duo Push notification to your phone, and then you should be logged in. Other than the Duo Push, the VPN behaves exactly like it did before. If you have issues you can always revert back to the old configuration, which will still work for a while. We will disable the old VPN soon though, so make every effort to get the new VPN setup working. Once you have the new VPN working, feel free to delete the old profile from Tunnelblick by clicking on the old profile in the "Configurations" window, then click on the '''"-"''' button below to remove the old configuration. As always, please email '''cluster-admin@soe.ucsc.edu''' if you need help or have any questions. 6fdb4f929853d90a0e7ca8f7ee21f8d1f7781369 618 617 2025-02-10T23:23:26Z Weiler 3 wikitext text/x-wiki If you are using Tunnelblick on MacOS and you are looking to convert to the new MFA-enabled GI VPN, you have come to the right place. You must already have Duo set up with your CruzID (which most of you do). If for some reason you don't have Duo set up yet on your phone, go here to enroll a device and configure Push Notifications with Duo before continuing: https://its.ucsc.edu/mfa/enroll.html OK! Let's get to it. Disconnect from the VPN if you are already connected. Then you will need to download the new OpenVPN config file from here: https://giwiki.gi.ucsc.edu/downloads/prism-duo.ovpn The credentials to access that website are username: '''genecats''' and password: '''KiloKluster''' Download that file to your Desktop. Then open Tunnelblick and click on the Tunnelblick icon on the top right of your screen next to the date. It kind of looks like a small tunnel. In the window that opens, select "VPN Details...". In the resulting window, select the "Configurations" tab on the top. You will see a list of Configurations on the left, and it should include the current configuration you use to connect. It may be called 'prism' or maybe 'client'. Drag the new configuration called '''prism-duo.ovpn''' (from your Desktop) into the Configurations area beneath your old configuration. It should import the configuration. It will ask you if you want to install it for "Only You" or "All Users". Click "Only You". You will also be asked to type in your laptop password. That's it! Select the new configuration on the left and click the "Connect" button on the bottom right. It will ask you for your usual GI PRISM username and password that you usually use to connect to our VPN, and after that it will send a Duo Push notification to your phone, and then you should be logged in. Other than the Duo Push, the VPN behaves exactly like it did before. If you have issues you can always revert back to the old configuration, which will still work for a while. We will disable the old VPN soon though, so make every effort to get the new VPN setup working. Once you have the new VPN working, feel free to delete the old profile from Tunnelblick by clicking on the old profile in the "Configurations" window, then click on the '''"-"''' button below to remove the old configuration. As always, please email '''cluster-admin@soe.ucsc.edu''' if you need help or have any questions. 71d314d25d4ce49a3e8b0275a6a934750a532645 635 618 2025-02-13T19:04:02Z Weiler 3 wikitext text/x-wiki If you are using Tunnelblick on MacOS and you are looking to convert to the new MFA-enabled GI VPN, you have come to the right place. You must already have Duo set up with your CruzID (which most of you do). If for some reason you don't have Duo set up yet on your phone, go here to enroll a device and configure Push Notifications with Duo before continuing: https://its.ucsc.edu/mfa/enroll.html OK! Let's get to it. Disconnect from the VPN if you are already connected. Then you will need to download the new OpenVPN config file from here: https://giwiki.gi.ucsc.edu/downloads/prism-duo.ovpn The credentials to access that website are username: '''genecats''' and password: '''KiloKluster''' Download that file to your Desktop by right-clicking on the link above and selecting "Save Link As...". Then open Tunnelblick and click on the Tunnelblick icon on the top right of your screen next to the date. It kind of looks like a small tunnel. In the window that opens, select "VPN Details...". In the resulting window, select the "Configurations" tab on the top. You will see a list of Configurations on the left, and it should include the current configuration you use to connect. It may be called 'prism' or maybe 'client'. Drag the new configuration called '''prism-duo.ovpn''' (from your Desktop) into the Configurations area beneath your old configuration. It should import the configuration. It will ask you if you want to install it for "Only You" or "All Users". Click "Only You". You will also be asked to type in your laptop password. That's it! Select the new configuration on the left and click the "Connect" button on the bottom right. It will ask you for your usual GI PRISM username and password that you usually use to connect to our VPN, and after that it will send a Duo Push notification to your phone, and then you should be logged in. Other than the Duo Push, the VPN behaves exactly like it did before. If you have issues you can always revert back to the old configuration, which will still work for a while. We will disable the old VPN soon though, so make every effort to get the new VPN setup working. Once you have the new VPN working, feel free to delete the old profile from Tunnelblick by clicking on the old profile in the "Configurations" window, then click on the '''"-"''' button below to remove the old configuration. As always, please email '''cluster-admin@soe.ucsc.edu''' if you need help or have any questions. c9c3dde32cde8901cb05d2b694736d98d89630cc 636 635 2025-02-13T19:04:32Z Weiler 3 wikitext text/x-wiki If you are using Tunnelblick on MacOS and you are looking to convert to the new MFA-enabled GI VPN, you have come to the right place. You must already have Duo set up with your CruzID (which most of you do). If for some reason you don't have Duo set up yet on your phone, go here to enroll a device and configure Push Notifications with Duo before continuing: https://its.ucsc.edu/mfa/enroll.html OK! Let's get to it. Disconnect from the VPN if you are already connected. Then you will need to download the new OpenVPN config file from here: https://giwiki.gi.ucsc.edu/downloads/prism-duo.ovpn The credentials to access that website are username: '''genecats''' and password: '''KiloKluster''' Download that file to your Desktop by right-clicking on the link above and selecting "Save Link As...", and save it to your Desktop or some other area you will remember. Then open Tunnelblick and click on the Tunnelblick icon on the top right of your screen next to the date. It kind of looks like a small tunnel. In the window that opens, select "VPN Details...". In the resulting window, select the "Configurations" tab on the top. You will see a list of Configurations on the left, and it should include the current configuration you use to connect. It may be called 'prism' or maybe 'client'. Drag the new configuration called '''prism-duo.ovpn''' (from your Desktop) into the Configurations area beneath your old configuration. It should import the configuration. It will ask you if you want to install it for "Only You" or "All Users". Click "Only You". You will also be asked to type in your laptop password. That's it! Select the new configuration on the left and click the "Connect" button on the bottom right. It will ask you for your usual GI PRISM username and password that you usually use to connect to our VPN, and after that it will send a Duo Push notification to your phone, and then you should be logged in. Other than the Duo Push, the VPN behaves exactly like it did before. If you have issues you can always revert back to the old configuration, which will still work for a while. We will disable the old VPN soon though, so make every effort to get the new VPN setup working. Once you have the new VPN working, feel free to delete the old profile from Tunnelblick by clicking on the old profile in the "Configurations" window, then click on the '''"-"''' button below to remove the old configuration. As always, please email '''cluster-admin@soe.ucsc.edu''' if you need help or have any questions. a3aa00fb3a64a76d0f286ef0a66989e6a270805b 639 636 2025-02-13T19:05:38Z Weiler 3 wikitext text/x-wiki If you are using Tunnelblick on MacOS and you are looking to convert to the new MFA-enabled GI VPN, you have come to the right place. You must already have Duo set up with your CruzID (which most of you do). If for some reason you don't have Duo set up yet on your phone, go here to enroll a device and configure Push Notifications with Duo before continuing: https://its.ucsc.edu/mfa/enroll.html OK! Let's get to it. Disconnect from the VPN if you are already connected. Then you will need to download the new OpenVPN config file from here: https://giwiki.gi.ucsc.edu/downloads/prism-duo.ovpn The credentials to access that website are username: '''genecats''' and password: '''KiloKluster''' Download that file by right-clicking on the link above and selecting "Save Link As...", and save it to your Desktop or some other area you will remember. Then open Tunnelblick and click on the Tunnelblick icon on the top right of your screen next to the date. It kind of looks like a small tunnel. In the window that opens, select "VPN Details...". In the resulting window, select the "Configurations" tab on the top. You will see a list of Configurations on the left, and it should include the current configuration you use to connect. It may be called 'prism' or maybe 'client'. Drag the new configuration called '''prism-duo.ovpn''' (from your Desktop) into the Configurations area beneath your old configuration. It should import the configuration. It will ask you if you want to install it for "Only You" or "All Users". Click "Only You". You will also be asked to type in your laptop password. That's it! Select the new configuration on the left and click the "Connect" button on the bottom right. It will ask you for your usual GI PRISM username and password that you usually use to connect to our VPN, and after that it will send a Duo Push notification to your phone, and then you should be logged in. Other than the Duo Push, the VPN behaves exactly like it did before. If you have issues you can always revert back to the old configuration, which will still work for a while. We will disable the old VPN soon though, so make every effort to get the new VPN setup working. Once you have the new VPN working, feel free to delete the old profile from Tunnelblick by clicking on the old profile in the "Configurations" window, then click on the '''"-"''' button below to remove the old configuration. As always, please email '''cluster-admin@soe.ucsc.edu''' if you need help or have any questions. 64c018dc73afbb98a06e4c040da500facda56d0b 650 639 2025-02-21T00:39:36Z Weiler 3 wikitext text/x-wiki If you are using Tunnelblick on MacOS and you are looking to convert to the new MFA-enabled GI VPN, you have come to the right place. You must already have Duo set up with your CruzID (which most of you do). If for some reason you don't have Duo set up yet on your phone, go here to enroll a device and configure Push Notifications with Duo before continuing: https://its.ucsc.edu/mfa/enroll.html OK! Let's get to it. Disconnect from the VPN if you are already connected. Then you will need to download the new OpenVPN config file from here: https://giwiki.gi.ucsc.edu/downloads/ The credentials to access that website are username: '''genecats''' and password: '''KiloKluster''' Go to the link above and right-click on '''prism.ovpn''' and selecting "Save Link As...", and save it to your Desktop or some other area you will remember. Then open Tunnelblick and click on the Tunnelblick icon on the top right of your screen next to the date. It kind of looks like a small tunnel. In the window that opens, select "VPN Details...". In the resulting window, select the "Configurations" tab on the top. You will see a list of Configurations on the left, and it should include the current configuration you use to connect. It may be called 'prism' or maybe 'client'. Drag the new configuration called '''prism-duo.ovpn''' (from your Desktop) into the Configurations area beneath your old configuration. It should import the configuration. It will ask you if you want to install it for "Only You" or "All Users". Click "Only You". You will also be asked to type in your laptop password. That's it! Select the new configuration on the left and click the "Connect" button on the bottom right. It will ask you for your usual GI PRISM username and password that you usually use to connect to our VPN, and after that it will send a Duo Push notification to your phone, and then you should be logged in. Other than the Duo Push, the VPN behaves exactly like it did before. If you have issues you can always revert back to the old configuration, which will still work for a while. We will disable the old VPN soon though, so make every effort to get the new VPN setup working. Once you have the new VPN working, feel free to delete the old profile from Tunnelblick by clicking on the old profile in the "Configurations" window, then click on the '''"-"''' button below to remove the old configuration. As always, please email '''cluster-admin@soe.ucsc.edu''' if you need help or have any questions. 3ec348a159bf12f70ff8521a9b99959400a3729f 652 650 2025-02-21T00:55:15Z Weiler 3 wikitext text/x-wiki If you are using Tunnelblick on MacOS and you are looking to convert to the new MFA-enabled GI VPN, you have come to the right place. You must already have Duo set up with your CruzID (which most of you do). If for some reason you don't have Duo set up yet on your phone, go here to enroll a device and configure Push Notifications with Duo before continuing: https://its.ucsc.edu/mfa/enroll.html OK! Let's get to it. Disconnect from the VPN if you are already connected. Then you will need to download the new OpenVPN config file from here: https://giwiki.gi.ucsc.edu/downloads/ The credentials to access that website are username: '''genecats''' and password: '''KiloKluster''' Go to the link above and right-click on '''prism-duo.ovpn''' and selecting "Save Link As...", and save it to your Desktop or some other area you will remember. Then open Tunnelblick and click on the Tunnelblick icon on the top right of your screen next to the date. It kind of looks like a small tunnel. In the window that opens, select "VPN Details...". In the resulting window, select the "Configurations" tab on the top. You will see a list of Configurations on the left, and it should include the current configuration you use to connect. It may be called 'prism' or maybe 'client'. Drag the new configuration called '''prism-duo.ovpn''' (from your Desktop) into the Configurations area beneath your old configuration. It should import the configuration. It will ask you if you want to install it for "Only You" or "All Users". Click "Only You". You will also be asked to type in your laptop password. That's it! Select the new configuration on the left and click the "Connect" button on the bottom right. It will ask you for your usual GI PRISM username and password that you usually use to connect to our VPN, and after that it will send a Duo Push notification to your phone, and then you should be logged in. Other than the Duo Push, the VPN behaves exactly like it did before. If you have issues you can always revert back to the old configuration, which will still work for a while. We will disable the old VPN soon though, so make every effort to get the new VPN setup working. Once you have the new VPN working, feel free to delete the old profile from Tunnelblick by clicking on the old profile in the "Configurations" window, then click on the '''"-"''' button below to remove the old configuration. As always, please email '''cluster-admin@soe.ucsc.edu''' if you need help or have any questions. 3d634230d5b2c695dd10691048d9ad110a73e14a Converting From Non-MFA VPN to the MFA-Enabled VPN on Linux 0 71 619 2025-02-10T23:40:54Z Weiler 3 Created page with "If you are using OpenVPN on Linux to connect to the GI VPN and you are looking to convert to the new MFA-enabled GI VPN, you have come to the right place. You must already have Duo set up with your CruzID (which most of you do). If for some reason you don't have Duo set up yet on your phone, go here to enroll a device and configure Push Notifications with Duo before continuing: https://its.ucsc.edu/mfa/enroll.html OK! Let's get to it. Disconnect from the VPN if yo..." wikitext text/x-wiki If you are using OpenVPN on Linux to connect to the GI VPN and you are looking to convert to the new MFA-enabled GI VPN, you have come to the right place. You must already have Duo set up with your CruzID (which most of you do). If for some reason you don't have Duo set up yet on your phone, go here to enroll a device and configure Push Notifications with Duo before continuing: https://its.ucsc.edu/mfa/enroll.html OK! Let's get to it. Disconnect from the VPN if you are already connected. All the various flavors and versions of Linux vary in the specifics, so you may not be following these exact instructions to get it to work. This is based on the Network Manager in Ubuntu, but most Ubuntu/Debian variants will be similar. Then you will need to download the new OpenVPN config file from here: https://giwiki.gi.ucsc.edu/downloads/prism-duo.ovpn The credentials to access that website are username: '''genecats''' and password: '''KiloKluster''' Download that file to your Desktop or some other easy to remember location. We will be installing the Prism VPN profile via the Network Manager GUI interface. Open '''Network Manager''' from '''Gnome Settings''' option and select the '''Network''' tab and click on the '''VPN +''' symbol: [[File:Configuring_1.png|600px]] From the '''Add VPN''' window, click on the '''Import from file...''' option: [[File:Configuring_2.png|600px]] You must navigate to your .ovpn file (/path/to/your/prism-duo.ovpn) and click on '''Open''' button: [[File:Configuring_3.png|600px]] Click on the '''Add''' button: [[File:Configuring_4.png|600px]] Finally, click the '''On/Off''' button to start on the new VPN: [[File:Configuring_5.png|600px]] That's it! It will ask you for your usual GI PRISM username and password that you usually use to connect to our VPN, and after that it will send a Duo Push notification to your phone, and then you should be logged in. Other than the Duo Push, the VPN behaves exactly like it did before. If you have issues you can always revert back to the old configuration, which will still work for a while. We will disable the old VPN soon though, so make every effort to get the new VPN setup working. Once you have the new VPN working, feel free to delete the old profile from the Network Manager. As always, please email '''cluster-admin@soe.ucsc.edu''' if you need help or have any questions. 145ad699a30df69628bdd1a8e82f9df394dedb25 638 619 2025-02-13T19:05:22Z Weiler 3 wikitext text/x-wiki If you are using OpenVPN on Linux to connect to the GI VPN and you are looking to convert to the new MFA-enabled GI VPN, you have come to the right place. You must already have Duo set up with your CruzID (which most of you do). If for some reason you don't have Duo set up yet on your phone, go here to enroll a device and configure Push Notifications with Duo before continuing: https://its.ucsc.edu/mfa/enroll.html OK! Let's get to it. Disconnect from the VPN if you are already connected. All the various flavors and versions of Linux vary in the specifics, so you may not be following these exact instructions to get it to work. This is based on the Network Manager in Ubuntu, but most Ubuntu/Debian variants will be similar. Then you will need to download the new OpenVPN config file from here: https://giwiki.gi.ucsc.edu/downloads/prism-duo.ovpn The credentials to access that website are username: '''genecats''' and password: '''KiloKluster''' Download that file right-clicking on the link above and selecting "Save Link As...", and save it to your Desktop or some other area you will remember. or some other easy to remember location. We will be installing the Prism VPN profile via the Network Manager GUI interface. Open '''Network Manager''' from '''Gnome Settings''' option and select the '''Network''' tab and click on the '''VPN +''' symbol: [[File:Configuring_1.png|600px]] From the '''Add VPN''' window, click on the '''Import from file...''' option: [[File:Configuring_2.png|600px]] You must navigate to your .ovpn file (/path/to/your/prism-duo.ovpn) and click on '''Open''' button: [[File:Configuring_3.png|600px]] Click on the '''Add''' button: [[File:Configuring_4.png|600px]] Finally, click the '''On/Off''' button to start on the new VPN: [[File:Configuring_5.png|600px]] That's it! It will ask you for your usual GI PRISM username and password that you usually use to connect to our VPN, and after that it will send a Duo Push notification to your phone, and then you should be logged in. Other than the Duo Push, the VPN behaves exactly like it did before. If you have issues you can always revert back to the old configuration, which will still work for a while. We will disable the old VPN soon though, so make every effort to get the new VPN setup working. Once you have the new VPN working, feel free to delete the old profile from the Network Manager. As always, please email '''cluster-admin@soe.ucsc.edu''' if you need help or have any questions. e6b5466d91769705601d6c0ed2ad8ba6fccbd128 Converting From Non-MFA VPN to the MFA-Enabled VPN on Windows 0 72 620 2025-02-10T23:54:55Z Weiler 3 Created page with "If you are using OpenVPN Connect on Windows 10 or 11 to connect to the GI VPN, and you are looking to convert to the new MFA-enabled GI VPN, you have come to the right place. You must already have Duo set up with your CruzID (which most of you do). If for some reason you don't have Duo set up yet on your phone, go here to enroll a device and configure Push Notifications with Duo before continuing: https://its.ucsc.edu/mfa/enroll.html OK! Let's get to it. Disconnec..." wikitext text/x-wiki If you are using OpenVPN Connect on Windows 10 or 11 to connect to the GI VPN, and you are looking to convert to the new MFA-enabled GI VPN, you have come to the right place. You must already have Duo set up with your CruzID (which most of you do). If for some reason you don't have Duo set up yet on your phone, go here to enroll a device and configure Push Notifications with Duo before continuing: https://its.ucsc.edu/mfa/enroll.html OK! Let's get to it. Disconnect from the VPN if you are already connected. Then you will need to download the new OpenVPN config file from here: https://giwiki.gi.ucsc.edu/downloads/prism-duo.ovpn The credentials to access that website are username: '''genecats''' and password: '''KiloKluster''' Download that file to your Desktop. Launch the '''OpenVPN Connect''' app (usually there is an icon for it on your Desktop, but you can search for it if not). It will launch and appear in your system tray on the bottom right (the system tray icon kind of looks like a '''^''' icon). You should see the OpenVPN icon there, it looks like a little computer screen with a lock on it. Right click on the OpenVPN icon in the system tray, and you should see a small menu appear. Select "Import file". In the resulting window, browse to your Desktop or wherever you saved the '''prism-duo.ovpn''' file. Select that file an click "Open". Once you import the file, you should be able to click on OpenVPN Connect again in the system tray and click "Connect". It should show multiple profiles, one for you old profile and one for your new profile. Select the new one. That's it! It will ask you for your usual GI PRISM username and password that you usually use to connect to our VPN, and after that it will send a Duo Push notification to your phone, and then you should be logged in. Other than the Duo Push, the VPN behaves exactly like it did before. If you have issues you can always revert back to the old configuration, which will still work for a while. We will disable the old VPN soon though, so make every effort to get the new VPN setup working. As always, please email '''cluster-admin@soe.ucsc.edu''' if you need help or have any questions. d8f39724494e534f87d387d5d7d7cfd421e4c348 621 620 2025-02-11T00:04:16Z Weiler 3 wikitext text/x-wiki If you are using OpenVPN Connect on Windows 10 or 11 to connect to the GI VPN, and you are looking to convert to the new MFA-enabled GI VPN, you have come to the right place. You must already have Duo set up with your CruzID (which most of you do). If for some reason you don't have Duo set up yet on your phone, go here to enroll a device and configure Push Notifications with Duo before continuing: https://its.ucsc.edu/mfa/enroll.html OK! Let's get to it. Disconnect from the VPN if you are already connected. Then you will need to download the new OpenVPN config file from here: https://giwiki.gi.ucsc.edu/downloads/prism-duo.ovpn The credentials to access that website are username: '''genecats''' and password: '''KiloKluster''' Download that file to your Desktop. Launch the '''OpenVPN Connect''' app (usually there is an icon for it on your Desktop, but you can search for it if not). It will launch and appear in your system tray on the bottom right (the system tray icon kind of looks like a '''^''' icon). You should see the OpenVPN icon there, it looks like a little computer screen with a lock on it. Right click on the OpenVPN icon in the system tray, and you should see a small menu appear. Select "Import file". In the resulting window, browse to your Desktop or wherever you saved the '''prism-duo.ovpn''' file. Select that file an click "Open". Once you import the file, you should be able to right click on OpenVPN Connect again in the system tray and select the profile you want to connect to. It should show multiple profiles, one for you old profile and one for your new profile. Select the new one, then select "Connect". That's it! It will ask you for your usual GI PRISM username and password that you usually use to connect to our VPN, and after that it will send a Duo Push notification to your phone, and then you should be logged in. Other than the Duo Push, the VPN behaves exactly like it did before. If you have issues you can always revert back to the old configuration, which will still work for a while. We will disable the old VPN soon though, so make every effort to get the new VPN setup working. As always, please email '''cluster-admin@soe.ucsc.edu''' if you need help or have any questions. e8c62ede9aff7143008747152172677276743dd3 622 621 2025-02-11T00:04:53Z Weiler 3 wikitext text/x-wiki If you are using OpenVPN Connect on Windows 10 or 11 to connect to the GI VPN, and you are looking to convert to the new MFA-enabled GI VPN, you have come to the right place. You must already have Duo set up with your CruzID (which most of you do). If for some reason you don't have Duo set up yet on your phone, go here to enroll a device and configure Push Notifications with Duo before continuing: https://its.ucsc.edu/mfa/enroll.html OK! Let's get to it. Disconnect from the VPN if you are already connected. Then you will need to download the new OpenVPN config file from here: https://giwiki.gi.ucsc.edu/downloads/prism-duo.ovpn The credentials to access that website are username: '''genecats''' and password: '''KiloKluster''' Download that file to your Desktop. Launch the '''OpenVPN Connect''' app (usually there is an icon for it on your Desktop, but you can search for it if not). It will launch and appear in your system tray on the bottom right (the system tray icon kind of looks like a '''^''' icon). You should see the OpenVPN icon there, it looks like a little computer screen with a lock on it. Right click on the OpenVPN icon in the system tray, and you should see a small menu appear. Select "Import file". In the resulting window, browse to your Desktop or wherever you saved the '''prism-duo.ovpn''' file. Select that file an click "Open". Once you import the file, you should be able to right click on OpenVPN Connect again in the system tray and select the profile you want to connect to. It should show multiple profiles, one for your old profile and one for your new profile. Select the new one, then select "Connect". That's it! It will ask you for your usual GI PRISM username and password that you usually use to connect to our VPN, and after that it will send a Duo Push notification to your phone, and then you should be logged in. Other than the Duo Push, the VPN behaves exactly like it did before. If you have issues you can always revert back to the old configuration, which will still work for a while. We will disable the old VPN soon though, so make every effort to get the new VPN setup working. As always, please email '''cluster-admin@soe.ucsc.edu''' if you need help or have any questions. ecabb1a8d9c504cfd11057c1b009274ed03c6d21 634 622 2025-02-12T22:17:17Z Weiler 3 wikitext text/x-wiki If you are using OpenVPN Connect on Windows 10 or 11 to connect to the GI VPN, and you are looking to convert to the new MFA-enabled GI VPN, you have come to the right place. You must already have Duo set up with your CruzID (which most of you do). If for some reason you don't have Duo set up yet on your phone, go here to enroll a device and configure Push Notifications with Duo before continuing: https://its.ucsc.edu/mfa/enroll.html OK! Let's get to it. Disconnect from the VPN if you are already connected. Then you will need to download the new OpenVPN config file from here: https://giwiki.gi.ucsc.edu/downloads/prism-duo.ovpn The credentials to access that website are username: '''genecats''' and password: '''KiloKluster''' Download that file to your Desktop. Launch the '''OpenVPN GUI''' app (usually there is an icon for it on your Desktop, but you can search for it if not). It will launch and appear in your system tray on the bottom right (the system tray icon kind of looks like a '''^''' icon). You should see the OpenVPN icon there, it looks like a little computer screen with a lock on it. Right click on the OpenVPN icon in the system tray, and you should see a small menu appear. Select "Import file". In the resulting window, browse to your Desktop or wherever you saved the '''prism-duo.ovpn''' file. Select that file an click "Open". Once you import the file, you should be able to right click on OpenVPN Connect again in the system tray and select the profile you want to connect to. It should show multiple profiles, one for your old profile and one for your new profile. Select the new one, then select "Connect". That's it! It will ask you for your usual GI PRISM username and password that you usually use to connect to our VPN, and after that it will send a Duo Push notification to your phone, and then you should be logged in. Other than the Duo Push, the VPN behaves exactly like it did before. If you have issues you can always revert back to the old configuration, which will still work for a while. We will disable the old VPN soon though, so make every effort to get the new VPN setup working. As always, please email '''cluster-admin@soe.ucsc.edu''' if you need help or have any questions. b9f686249fa31e3340331eb0076df472a2bfdba1 637 634 2025-02-13T19:05:00Z Weiler 3 wikitext text/x-wiki If you are using OpenVPN Connect on Windows 10 or 11 to connect to the GI VPN, and you are looking to convert to the new MFA-enabled GI VPN, you have come to the right place. You must already have Duo set up with your CruzID (which most of you do). If for some reason you don't have Duo set up yet on your phone, go here to enroll a device and configure Push Notifications with Duo before continuing: https://its.ucsc.edu/mfa/enroll.html OK! Let's get to it. Disconnect from the VPN if you are already connected. Then you will need to download the new OpenVPN config file from here: https://giwiki.gi.ucsc.edu/downloads/prism-duo.ovpn The credentials to access that website are username: '''genecats''' and password: '''KiloKluster''' Download that file by right-clicking on the link above and selecting "Save Link As...", and save it to your Desktop or some other area you will remember.. Launch the '''OpenVPN GUI''' app (usually there is an icon for it on your Desktop, but you can search for it if not). It will launch and appear in your system tray on the bottom right (the system tray icon kind of looks like a '''^''' icon). You should see the OpenVPN icon there, it looks like a little computer screen with a lock on it. Right click on the OpenVPN icon in the system tray, and you should see a small menu appear. Select "Import file". In the resulting window, browse to your Desktop or wherever you saved the '''prism-duo.ovpn''' file. Select that file an click "Open". Once you import the file, you should be able to right click on OpenVPN Connect again in the system tray and select the profile you want to connect to. It should show multiple profiles, one for your old profile and one for your new profile. Select the new one, then select "Connect". That's it! It will ask you for your usual GI PRISM username and password that you usually use to connect to our VPN, and after that it will send a Duo Push notification to your phone, and then you should be logged in. Other than the Duo Push, the VPN behaves exactly like it did before. If you have issues you can always revert back to the old configuration, which will still work for a while. We will disable the old VPN soon though, so make every effort to get the new VPN setup working. As always, please email '''cluster-admin@soe.ucsc.edu''' if you need help or have any questions. af0034d9601d2f54f67eb1c421666c07476429d1 Slurm Tips for Toil 0 38 626 470 2025-02-11T20:42:22Z Anovak 4 Add quotes to protect brackets wikitext text/x-wiki Here are some tips for running Toil workflows on the Phoenix Slurm cluster. Mostly you might want to run WDL workflows, but you can use some of these for other workflows like Cactus. You can also consult [https://github.com/DataBiosphere/toil/blob/master/docs/wdl/running.rst the Toil documentation on WDL workflows]. * Install Toil with WDL support with: pip3 install --upgrade 'toil[wdl]' To use a development version of Toil, you can install from source instead: pip3 install 'git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl]' Or for a particular branch: pip3 install 'git+https://github.com/DataBiosphere/toil.git@issues/123-abc#egg=toil[wdl]' * You will then need to make sure your '''~/.local/bin''' directory is on your PATH. Open up your '''~/.bashrc''' file and add: export PATH=$PATH:$HOME/.local/bin Then make sure to log out and back in again. * For Toil options, you will want '''--batchSystem slurm''' to make it use Slurm and '''--batchLogsDir ./logs''' (or some other location on a shared filesystem) for the Slurm logs to not get lost. * You may be able to speed up your workflow with '''--caching true''', to cache data on nodes to be shared among multiple simultaneous tasks. * If using '''toil-wdl-runner''', you might want to add '''--jobStore ./jobStore''' to make sure the job store is in a defined, shared location so that you can use '''--restart''' later. * If using '''toil-wdl-runner''', you will want to set the '''SINGULARITY_CACHEDIR''' and '''MINIWDL__SINGULARITY__IMAGE_CACHE''' environment variables for your workflow to locations on shared storage, and possibly to the default cache locations in your home directory. Otherwise Toil will set them to node-local directories for each node, and thus re-download images for each workflow run, and for each cluster node. To avoid this, you could, for example, before your run or in your '''~/.bashrc''' you could: export SINGULARITY_CACHEDIR=$HOME/.singularity/cache export MINIWDL__SINGULARITY__IMAGE_CACHE=$HOME/.cache/miniwdl aec63aec38c0e471a39f3519d9eebfaf81778cf1 627 626 2025-02-11T20:49:14Z Anovak 4 Change to new extras syntax from https://github.com/pypa/pip/pull/11617 wikitext text/x-wiki Here are some tips for running Toil workflows on the Phoenix Slurm cluster. Mostly you might want to run WDL workflows, but you can use some of these for other workflows like Cactus. You can also consult [https://github.com/DataBiosphere/toil/blob/master/docs/wdl/running.rst the Toil documentation on WDL workflows]. * Install Toil with WDL support with: pip3 install --upgrade 'toil[wdl]' To use a development version of Toil, you can install from source instead: pip3 install 'toil[wdl]@git+https://github.com/DataBiosphere/toil.git' Or for a particular branch: pip3 install 'toil[wdl]@git+https://github.com/DataBiosphere/toil.git@issues/123-abc' * You will then need to make sure your '''~/.local/bin''' directory is on your PATH. Open up your '''~/.bashrc''' file and add: export PATH=$PATH:$HOME/.local/bin Then make sure to log out and back in again. * For Toil options, you will want '''--batchSystem slurm''' to make it use Slurm and '''--batchLogsDir ./logs''' (or some other location on a shared filesystem) for the Slurm logs to not get lost. * You may be able to speed up your workflow with '''--caching true''', to cache data on nodes to be shared among multiple simultaneous tasks. * If using '''toil-wdl-runner''', you might want to add '''--jobStore ./jobStore''' to make sure the job store is in a defined, shared location so that you can use '''--restart''' later. * If using '''toil-wdl-runner''', you will want to set the '''SINGULARITY_CACHEDIR''' and '''MINIWDL__SINGULARITY__IMAGE_CACHE''' environment variables for your workflow to locations on shared storage, and possibly to the default cache locations in your home directory. Otherwise Toil will set them to node-local directories for each node, and thus re-download images for each workflow run, and for each cluster node. To avoid this, you could, for example, before your run or in your '''~/.bashrc''' you could: export SINGULARITY_CACHEDIR=$HOME/.singularity/cache export MINIWDL__SINGULARITY__IMAGE_CACHE=$HOME/.cache/miniwdl 351661bba433769fe22065f3ff4bc2a185ba0723 646 627 2025-02-14T19:12:39Z Anovak 4 wikitext text/x-wiki Here are some tips for running Toil workflows on the Phoenix Slurm cluster. Mostly you might want to run WDL workflows, but you can use some of these for other workflows like Cactus. You can also consult [https://github.com/DataBiosphere/toil/blob/master/docs/wdl/running.rst the Toil documentation on WDL workflows]. * Install Toil with WDL support with: pipx install 'toil[wdl]' To use a development version of Toil, you can install from source instead: pipx 'toil[wdl]@git+https://github.com/DataBiosphere/toil.git' Or for a particular branch: pipx install 'toil[wdl]@git+https://github.com/DataBiosphere/toil.git@issues/123-abc' If you don't have <code>pipx</code>, you would first need to: python3 -m pip install --user pipx python3 -m pipx ensurepath This may in turn need you to log out and back in. * For Toil options, you will want '''--batchSystem slurm''' to make it use Slurm and '''--batchLogsDir ./logs''' (or some other location on a shared filesystem) for the Slurm logs to not get lost. * You may be able to speed up your workflow with '''--caching true''', to cache data on nodes to be shared among multiple simultaneous tasks. * If using '''toil-wdl-runner''', you might want to add '''--jobStore ./jobStore''' to make sure the job store is in a defined, shared location so that you can use '''--restart''' later. * If using '''toil-wdl-runner''', you will want to set the '''SINGULARITY_CACHEDIR''' and '''MINIWDL__SINGULARITY__IMAGE_CACHE''' environment variables for your workflow to locations on shared storage, and possibly to the default cache locations in your home directory. Otherwise Toil will set them to node-local directories for each node, and thus re-download images for each workflow run, and for each cluster node. To avoid this, you could, for example, before your run or in your '''~/.bashrc''' you could: export SINGULARITY_CACHEDIR=$HOME/.singularity/cache export MINIWDL__SINGULARITY__IMAGE_CACHE=$HOME/.cache/miniwdl 1fff1e93ec5f7c131e8fe0dee624b9f5a445910b Phoenix WDL Tutorial 0 45 643 509 2025-02-14T19:00:28Z Anovak 4 /* Installing Toil with WDL support */ wikitext text/x-wiki '''Tutorial: Getting Started with WDL Workflows on Phoenix''' Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you. Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order. This tutorial will help you get started writing and running workflows. The '''Phoenix Cluster Setup''' section is specifically for the UC Santa Cruz Genomics Institute's Phoenix Slurm cluster. The other sections are broadly applicable to other environments. By the end, you will be able to run workflows on Slurm with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong. =Phoenix Cluster Setup= Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH. ==Getting VPN access== We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall. To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster. ==Connecting to Phoenix== Once you have VPN access, you can connect to any of the machines with access to the Phoenix cluster. These interactive nodes are fairly large machines that can do some work locally, but you will still want to run larger workflows on the actual cluster. For this tutorial, we will use <code>emerald.prism</code> as our login node. To connect to the cluster: 1. Connect to the VPN. 2. SSH to <code>emerald.prism</code>. At the command line, run: ssh emerald.prism If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run: ssh flastname@emerald.prism The first time you connect, you will see a message like: The authenticity of host 'emerald.prism (10.50.1.67)' can't be established. ED25519 key fingerprint is SHA256:8hJQShO6jhrym9UVyMldKsKOnOFtWRChgjK5cZNhkAI. This key is not known by any other names. Are you sure you want to continue connecting (yes/no/[fingerprint])? This is your computer asking you to help it decide if it is talking to the genuine <code>emerald.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:8hJQShO6jhrym9UVyMldKsKOnOFtWRChgjK5cZNhkAI.</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is. ==Installing Toil with WDL support== Once you are on the head node, you can install Toil, a program for running workflows. Toil is written in Python, and the modern way to install Python command line tools is with pipx. So [https://pipx.pypa.io/latest/installation/ install pipx]: python3 -m pip install --user pipx python3 -m pipx ensurepath When installing, you need to specify that you want WDL support. To do this, you can run: pipx install 'toil[wdl]' If you also want to use AWS S3 <code>s3://</code> and/or Google <code>gs://</code> URLs for data, you will need to also install Toil with the <code>aws</code> and <code>google</code> extras, respectively: pipx install 'toil[wdl,aws,google]' This will install Toil in the <code>.local</code> directory inside your home directory, which we write as <code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>. The <code>python3 -m pipx ensurepath</code> command should have added the <code>~/.local</code> directory to your <code>PATH</code> environment variable, to ensure you can find these commands. After that, **log out and log back in**, to restart bash and pick up the change. To make sure it worked, you can run: toil-wdl-runner --help If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports. If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above. ==Configuring your Phoenix Environment== '''Do not try and store data in your home directory on Phoenix!''' The home directories are meant for code and programs. Any data worth running a workflow on should be in a directory under <code>/private/groups</code>. You will probably need to email the admins to get added to a group so you can create a directory to work in somewhere under <code>/private/groups</code>. Usually you would end up with <code>/private/groups/YOURGROUPNAME/YOURUSERNAME</code>. Remember this path; we will need it later. ==Configuring Toil for Phoenix== Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. However, since these files can be large, and the home directory quota is only 30 GB, we might not be able to keep these in your home directory. We would like to be able to store these on the cluster's large storage array, under <code>/private/groups</code>. However, Toil needs to use file locks in these directories to prevent simultaneous Singularity calls from producing internal Singularity errors, and Ceph currently has [https://tracker.ceph.com/issues/65607 a bug where these file locking operations can freeze the Ceph servers]. If you have '''a small number of container images''' that will fit in your home directory, you can keep them there. [https://github.com/DataBiosphere/toil/commit/cb0b291bb7f6212bfe69221dd9f09d72f83e92fb Since Toil 6.1.0], this is the default behavior and you don't need to do anything. (Unless you previously set <code>SINGULARITY_CACHEDIR</code> or <code>MINIWDL__SINGULARITY__IMAGE_CACHE</code>, in which case you need to unset them.) '''If you don't have room in your home directory''' for container images, currently the recommended approach is to use node-local storage under <code>/data/tmp</code>. This results in each node pulling each container image, but images will be saved across workflows. You can set that up for all your workflows with: echo 'export SINGULARITY_CACHEDIR="/data/tmp/$(whoami)/cache/singularity"' >>~/.bashrc echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="/data/tmp/$(whoami)/cache/miniwdl"' >>~/.bashrc Then '''log out and log back in again''', to apply the changes. =Running an existing workflow= First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project]. First, go to your user directory under <code>/private/groups</code>, and make a directory to work in. cd /private/groups/YOURGROUPNAME/YOURUSERNAME mkdir workflow-test cd workflow-test Next, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally. wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl ==Preparing an input file== Near the top of the WDL file, there's a section like this: workflow hello_caller { input { File who } This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one. So first, we have to make that list of names. Let's make it in <code>names.txt</code> echo "Mridula Resurrección" >names.txt echo "Gershom Šarlota" >>names.txt echo "Ritchie Ravi" >>names.txt Then, we need to create an ''inputs file'', which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this: echo '{"hello_caller.who": "./names.txt"}' >inputs.json Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification]. ==Testing at small scale single-machine== We are now ready to run the workflow! You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running: srun -c 2 --mem 8G --time=02:00:00 --partition=medium --pty bash -i This will start a new shell that can run for 2 hours; to leave it and go back to the head node you can use <code>exit</code>. In your new shell, run this Toil command: toil-wdl-runner self_test.wdl inputs.json -o local_run This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>. This will print a lot of logging to standard error, and to standard output it will print: {"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]} The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person. To leave your interactive Slurm session and return to the head node, use <code>exit</code>. ==Running at larger scale== Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people! Go get this handy list of people and cut it to length: wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt head -n100 1000_names.txt >100_names.txt And make a new inputs file: echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster. To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. Until Toil [https://github.com/DataBiosphere/toil/issues/4775 gets support for data file caching on Slurm], we will also need the <code>--caching false</code> option. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it. Additionally, since [https://github.com/DataBiosphere/toil/issues/4686 Toil can't manage Slurm partitions itself], we will use the <code>TOIL_SLURM_ARGS</code> environment variable to tell Toil how long jobs should be allowed to run for (2 hours) and what [[Slurm Queues (Partitions) and Resource Management | partition]] they should go in. mkdir -p logs export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium" toil-wdl-runner --jobStore ./big_store --batchSystem slurm --caching false --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory. =Writing your own workflow= In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz]. ==Writing the file== ===Version=== All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files). Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0. So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement: version 1.0 ===Workflow Block=== Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>. version 1.0 workflow FizzBuzz { } ===Input Block=== Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } } Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is ''not'' optional, and there is no default value, then the user's inputs file ''must'' specify a value for it in order for the workflow to run. ===Body=== Now we'll start on the body of the workflow, to be inserted just after the inputs section. The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable. Array[Int] numbers = range(item_count) WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more. ===Scattering=== Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 } ===Conditionals=== Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array. Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only ''once'' in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>. Note that WDL supports conditional ''expressions'' with a <code>then</code> and an <code>else</code>, but conditional ''statements'' only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition. So first, let's handle the special cases. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. } } ===Calling Tasks=== Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later. To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]) } We can put the code into the workflow now, and set about writing the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } ===Writing Tasks=== Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>. task stringify_number { } We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections. task stringify_number { input { Int the_number } # ??? output { String the_string # = ??? } } Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string # = ??? } } Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]). task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } } We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage. Then we can put our task into our WDL file: version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } ===Output Block=== Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren't supposed to need this, but you do need it in 1.1 and Toil doesn't actually send your outputs anywhere yet if you don't have one, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } output { Array[String] fizzbuzz_results = result } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array. ==Running the Workflow== Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs: echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json Then run it on the cluster with Toil: mkdir -p logs export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium" toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --caching false --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json Or locally: toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json =Debugging Workflows= Sometimes, your workflow won't work. Try these ideas for figuring out what is going wrong. ==Restarting the Workflow== If you think your workflow failed from a transient problem (such as a Docker image not being available) that you have fixed, and you ran the workflow with <code>--jobStore</code> set manually to a directory that persists between attempts, you can add <code>--restart</code> to your workflow command and make Toil try again. It will pick up from where it left off and rerun any failed tasks and then the rest of the workflow. This will ''will not'' pick up any changes to your WDL source code files; those are read once at the beginning and not re-read on restart. If restarting the workflow doesn't help, you may need to move on to more advanced debugging techniques. ==Debugging Options== When debugging a workflow, make sure to run the workflow with <code>--logDebug</code>, to set the log level to <code>DEBUG</code>, and with <code>--jobStore /some/path/to/a/shared/directory/it/can/create</code> so that the stored files shipped between jobs are in a place you can access them. When debug logging is on, the log from every Toil job is inserted in the main Toil log between these markers: =========> Toil job log is here <========= Normally, only the logs of failing jobs and the output of commands run from WDL are reproduced like this. ==Reading the Log== When a WDL workflow fails, you are likely to see a message like this: WDL.runtime.error.CommandFailed: task command failed with exit status 1 [2023-07-16T16:23:54-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host phoenix-15.prism This means that the command line command specified by one of your WDL tasks exited with a failing (i.e. nonzero) exit code, which will happen when either the command line command is written wrong, or when the error detection code in the tool you are trying to run detects and reports an error. Go up higher in the log until you find lines that look like: [2024-01-16T20:12:19-0500] [Thread-3 (statsAndLoggingAggregator)] [I] [toil.statsAndLogging] hello_caller.0.hello.stderr follows: And [2024-01-16T20:12:19-0500] [Thread-3 (statsAndLoggingAggregator)] [I] [toil.statsAndLogging] hello_caller.0.hello.stdout follows: These will be followed by the standard error and standard output log data from the task's command. There may be useful information (such as an error message from the underlying tool) in there. If you would like individual task logs to be saved separately for later reference, you can use the <code>--writeLogs</code> option to specify a directory to store them. For more information, see [https://toil.readthedocs.io/en/latest/wdl/running.html#managing-workflow-logs the Toil documentation of workflow task logs]. ==Reproducing Problems== When trying to fix a failing step, it is useful to be able to run a command outside of Toil or WDL that might reproduce the problem. In addition to getting the standard output and standard error logs as described above, you may also need input files for your tool in order to do this. === Automatically Fetching Input Files === The <code>toil debug-job</code> command has a <code>--retrieveTaskDirectory</code> option that lets you dump out a directory with all the files that a failing WDL task would use. You can use it like: toil debug-job ./jobstore WDLTaskJob --retrieveTaskDirectory dumpdir If there are multiple failing tasks, you might need to replace <code>WDLTaskJob</code> with the name of one of the failing jobs. See [https://toil.readthedocs.io/en/latest/running/debugging.html#fetching-job-inputs the Toil documentation on retrieving files] for more on how to use this command. === Manually Finding Input Files === If you can't use <code>toil debug-job</code>, you might need to manually dig through the job store for files. In the log of your failing Toil task, look for lines like this: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-4f886176ab8344baaf17dc72fc445445/toplog.sh' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmprwhi6h3q/toplog.sh' [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmpjyksfoko/Sample.bam' ... The <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam</code> part is a Toil file ID, and it is a relative path from your <code>--jobStore</code> value to where the file is stored on disk. So if you ran the workflow with <code>--jobStore /private/groups/patenlab/anovak/jobstore</code>, you would look for this file at: /private/groups/patenlab/anovak/jobstore/files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam ==More Ways of Finding Files== Sometimes, a step might not fail, but you still might want to see the files it is using as input. If you have the job store path, you can use the <code>find</code> command to try and find the files by name. For example, if you want to look at <code>Sample.bam</code>, you can look for it like this: find /path/to/the/jobstore -name "Sample.bam" If you want to find files that were ''uploaded'' from a job, look for lines like this in the job's log: [2023-07-16T15:58:39-0700] [MainThread] [D] [toil.wdl.wdltoil] Virtualized /data/tmp/2846b6012e3e5535add03b363950dd78/cb23/197c/work/bamPerChrs/Sample.chr14.bam as WDL file toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam You can take the <code>toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam</code> URI and URL-decode it with, for example, [https://www.urldecoder.io/], getting this: toilfile:2703483274:0:files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam Then you can take the part after the last colon, <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam</code>, and that is the path relative to the job store where this file can be found. ==Using Development Versions of Toil== Sometimes, bugs will be fixed in the development version of Toil, but not released yet. To try the current development version of Toil, you can install it like this: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl,aws,google]' If you want to use a particular branch or commit, like <code>aaa451b320fc115b3563ced25cb501301cf86f90</code>, you can do: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git@aaa451b320fc115b3563ced25cb501301cf86f90#egg=toil[wdl,aws,google]' ==Frequently Asked Questions== ===I am getting warnings about <code>XDG_RUNTIME_DIR</code>=== You may be seeing warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code> You should upgrade Toil. [https://github.com/DataBiosphere/toil/commit/ff6bf60ab798a675c20156c749817c4313644b96 Since Toil 6.1.0], Toil no longer issues this warning, and just puts up with bad <code>XDG_RUNTIME_DIR</code> settings. ===Toil said it was <code>Redirecting logging</code> somewhere, but I can't find that file!=== The Toil worker process for each job will say that it is <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code>, and when running in single machine mode these messages go to the main Toil log. The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil. If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node. =Additional WDL resources= For more information on writing and running WDL workflows, see: * [https://docs.openwdl.org/en/stable/ The WDL dcoumentation] * [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube] * [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification] e9069759f0e587f54789417a9c2852c4e9337b13 644 643 2025-02-14T19:03:18Z Anovak 4 wikitext text/x-wiki '''Tutorial: Getting Started with WDL Workflows on Phoenix''' Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you. Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order. This tutorial will help you get started writing and running workflows. The '''Phoenix Cluster Setup''' section is specifically for the UC Santa Cruz Genomics Institute's Phoenix Slurm cluster. The other sections are broadly applicable to other environments. By the end, you will be able to run workflows on Slurm with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong. =Phoenix Cluster Setup= Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH. ==Getting VPN access== We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall. To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster. ==Connecting to Phoenix== Once you have VPN access, you can connect to any of the machines with access to the Phoenix cluster. These interactive nodes are fairly large machines that can do some work locally, but you will still want to run larger workflows on the actual cluster. For this tutorial, we will use <code>emerald.prism</code> as our login node. To connect to the cluster: 1. Connect to the VPN. 2. SSH to <code>emerald.prism</code>. At the command line, run: ssh emerald.prism If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run: ssh flastname@emerald.prism The first time you connect, you will see a message like: The authenticity of host 'emerald.prism (10.50.1.67)' can't be established. ED25519 key fingerprint is SHA256:8hJQShO6jhrym9UVyMldKsKOnOFtWRChgjK5cZNhkAI. This key is not known by any other names. Are you sure you want to continue connecting (yes/no/[fingerprint])? This is your computer asking you to help it decide if it is talking to the genuine <code>emerald.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:8hJQShO6jhrym9UVyMldKsKOnOFtWRChgjK5cZNhkAI.</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is. ==Installing Toil with WDL support== Once you are on the head node, you can install Toil, a program for running workflows. Toil is written in Python, and the modern way to install Python command line tools is with pipx. So [https://pipx.pypa.io/latest/installation/ install pipx]: python3 -m pip install --user pipx python3 -m pipx ensurepath This may instruct you to log out and log back in or take some other action to adopt the new <code>PATH</code> settings. When installing Toil, you need to specify that you want WDL support. To do this, you can run: pipx install 'toil[wdl]' If you also want to use AWS S3 <code>s3://</code> and/or Google <code>gs://</code> URLs for data, you will need to also install Toil with the <code>aws</code> and <code>google</code> extras, respectively: pipx install 'toil[wdl,aws,google]' To change what extras are used when you have an existing Toil installation, you will need to use the <code>--force</code> option. This will install Toil in the <code>.local</code> directory inside your home directory, which we write as <code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>. The <code>python3 -m pipx ensurepath</code> command should have added the <code>~/.local</code> directory to your <code>PATH</code> environment variable, to ensure you can find these commands. After that, **log out and log back in**, to restart bash and pick up the change. To make sure it worked, you can run: toil-wdl-runner --help If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports. If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above. ==Configuring your Phoenix Environment== '''Do not try and store data in your home directory on Phoenix!''' The home directories are meant for code and programs. Any data worth running a workflow on should be in a directory under <code>/private/groups</code>. You will probably need to email the admins to get added to a group so you can create a directory to work in somewhere under <code>/private/groups</code>. Usually you would end up with <code>/private/groups/YOURGROUPNAME/YOURUSERNAME</code>. Remember this path; we will need it later. ==Configuring Toil for Phoenix== Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. However, since these files can be large, and the home directory quota is only 30 GB, we might not be able to keep these in your home directory. We would like to be able to store these on the cluster's large storage array, under <code>/private/groups</code>. However, Toil needs to use file locks in these directories to prevent simultaneous Singularity calls from producing internal Singularity errors, and Ceph currently has [https://tracker.ceph.com/issues/65607 a bug where these file locking operations can freeze the Ceph servers]. If you have '''a small number of container images''' that will fit in your home directory, you can keep them there. [https://github.com/DataBiosphere/toil/commit/cb0b291bb7f6212bfe69221dd9f09d72f83e92fb Since Toil 6.1.0], this is the default behavior and you don't need to do anything. (Unless you previously set <code>SINGULARITY_CACHEDIR</code> or <code>MINIWDL__SINGULARITY__IMAGE_CACHE</code>, in which case you need to unset them.) '''If you don't have room in your home directory''' for container images, currently the recommended approach is to use node-local storage under <code>/data/tmp</code>. This results in each node pulling each container image, but images will be saved across workflows. You can set that up for all your workflows with: echo 'export SINGULARITY_CACHEDIR="/data/tmp/$(whoami)/cache/singularity"' >>~/.bashrc echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="/data/tmp/$(whoami)/cache/miniwdl"' >>~/.bashrc Then '''log out and log back in again''', to apply the changes. =Running an existing workflow= First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project]. First, go to your user directory under <code>/private/groups</code>, and make a directory to work in. cd /private/groups/YOURGROUPNAME/YOURUSERNAME mkdir workflow-test cd workflow-test Next, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally. wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl ==Preparing an input file== Near the top of the WDL file, there's a section like this: workflow hello_caller { input { File who } This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one. So first, we have to make that list of names. Let's make it in <code>names.txt</code> echo "Mridula Resurrección" >names.txt echo "Gershom Šarlota" >>names.txt echo "Ritchie Ravi" >>names.txt Then, we need to create an ''inputs file'', which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this: echo '{"hello_caller.who": "./names.txt"}' >inputs.json Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification]. ==Testing at small scale single-machine== We are now ready to run the workflow! You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running: srun -c 2 --mem 8G --time=02:00:00 --partition=medium --pty bash -i This will start a new shell that can run for 2 hours; to leave it and go back to the head node you can use <code>exit</code>. In your new shell, run this Toil command: toil-wdl-runner self_test.wdl inputs.json -o local_run This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>. This will print a lot of logging to standard error, and to standard output it will print: {"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]} The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person. To leave your interactive Slurm session and return to the head node, use <code>exit</code>. ==Running at larger scale== Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people! Go get this handy list of people and cut it to length: wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt head -n100 1000_names.txt >100_names.txt And make a new inputs file: echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster. To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. Until Toil [https://github.com/DataBiosphere/toil/issues/4775 gets support for data file caching on Slurm], we will also need the <code>--caching false</code> option. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it. Additionally, since [https://github.com/DataBiosphere/toil/issues/4686 Toil can't manage Slurm partitions itself], we will use the <code>TOIL_SLURM_ARGS</code> environment variable to tell Toil how long jobs should be allowed to run for (2 hours) and what [[Slurm Queues (Partitions) and Resource Management | partition]] they should go in. mkdir -p logs export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium" toil-wdl-runner --jobStore ./big_store --batchSystem slurm --caching false --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory. =Writing your own workflow= In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz]. ==Writing the file== ===Version=== All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files). Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0. So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement: version 1.0 ===Workflow Block=== Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>. version 1.0 workflow FizzBuzz { } ===Input Block=== Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } } Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is ''not'' optional, and there is no default value, then the user's inputs file ''must'' specify a value for it in order for the workflow to run. ===Body=== Now we'll start on the body of the workflow, to be inserted just after the inputs section. The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable. Array[Int] numbers = range(item_count) WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more. ===Scattering=== Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 } ===Conditionals=== Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array. Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only ''once'' in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>. Note that WDL supports conditional ''expressions'' with a <code>then</code> and an <code>else</code>, but conditional ''statements'' only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition. So first, let's handle the special cases. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. } } ===Calling Tasks=== Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later. To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]) } We can put the code into the workflow now, and set about writing the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } ===Writing Tasks=== Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>. task stringify_number { } We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections. task stringify_number { input { Int the_number } # ??? output { String the_string # = ??? } } Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string # = ??? } } Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]). task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } } We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage. Then we can put our task into our WDL file: version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } ===Output Block=== Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren't supposed to need this, but you do need it in 1.1 and Toil doesn't actually send your outputs anywhere yet if you don't have one, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } output { Array[String] fizzbuzz_results = result } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array. ==Running the Workflow== Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs: echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json Then run it on the cluster with Toil: mkdir -p logs export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium" toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --caching false --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json Or locally: toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json =Debugging Workflows= Sometimes, your workflow won't work. Try these ideas for figuring out what is going wrong. ==Restarting the Workflow== If you think your workflow failed from a transient problem (such as a Docker image not being available) that you have fixed, and you ran the workflow with <code>--jobStore</code> set manually to a directory that persists between attempts, you can add <code>--restart</code> to your workflow command and make Toil try again. It will pick up from where it left off and rerun any failed tasks and then the rest of the workflow. This will ''will not'' pick up any changes to your WDL source code files; those are read once at the beginning and not re-read on restart. If restarting the workflow doesn't help, you may need to move on to more advanced debugging techniques. ==Debugging Options== When debugging a workflow, make sure to run the workflow with <code>--logDebug</code>, to set the log level to <code>DEBUG</code>, and with <code>--jobStore /some/path/to/a/shared/directory/it/can/create</code> so that the stored files shipped between jobs are in a place you can access them. When debug logging is on, the log from every Toil job is inserted in the main Toil log between these markers: =========> Toil job log is here <========= Normally, only the logs of failing jobs and the output of commands run from WDL are reproduced like this. ==Reading the Log== When a WDL workflow fails, you are likely to see a message like this: WDL.runtime.error.CommandFailed: task command failed with exit status 1 [2023-07-16T16:23:54-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host phoenix-15.prism This means that the command line command specified by one of your WDL tasks exited with a failing (i.e. nonzero) exit code, which will happen when either the command line command is written wrong, or when the error detection code in the tool you are trying to run detects and reports an error. Go up higher in the log until you find lines that look like: [2024-01-16T20:12:19-0500] [Thread-3 (statsAndLoggingAggregator)] [I] [toil.statsAndLogging] hello_caller.0.hello.stderr follows: And [2024-01-16T20:12:19-0500] [Thread-3 (statsAndLoggingAggregator)] [I] [toil.statsAndLogging] hello_caller.0.hello.stdout follows: These will be followed by the standard error and standard output log data from the task's command. There may be useful information (such as an error message from the underlying tool) in there. If you would like individual task logs to be saved separately for later reference, you can use the <code>--writeLogs</code> option to specify a directory to store them. For more information, see [https://toil.readthedocs.io/en/latest/wdl/running.html#managing-workflow-logs the Toil documentation of workflow task logs]. ==Reproducing Problems== When trying to fix a failing step, it is useful to be able to run a command outside of Toil or WDL that might reproduce the problem. In addition to getting the standard output and standard error logs as described above, you may also need input files for your tool in order to do this. === Automatically Fetching Input Files === The <code>toil debug-job</code> command has a <code>--retrieveTaskDirectory</code> option that lets you dump out a directory with all the files that a failing WDL task would use. You can use it like: toil debug-job ./jobstore WDLTaskJob --retrieveTaskDirectory dumpdir If there are multiple failing tasks, you might need to replace <code>WDLTaskJob</code> with the name of one of the failing jobs. See [https://toil.readthedocs.io/en/latest/running/debugging.html#fetching-job-inputs the Toil documentation on retrieving files] for more on how to use this command. === Manually Finding Input Files === If you can't use <code>toil debug-job</code>, you might need to manually dig through the job store for files. In the log of your failing Toil task, look for lines like this: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-4f886176ab8344baaf17dc72fc445445/toplog.sh' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmprwhi6h3q/toplog.sh' [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmpjyksfoko/Sample.bam' ... The <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam</code> part is a Toil file ID, and it is a relative path from your <code>--jobStore</code> value to where the file is stored on disk. So if you ran the workflow with <code>--jobStore /private/groups/patenlab/anovak/jobstore</code>, you would look for this file at: /private/groups/patenlab/anovak/jobstore/files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam ==More Ways of Finding Files== Sometimes, a step might not fail, but you still might want to see the files it is using as input. If you have the job store path, you can use the <code>find</code> command to try and find the files by name. For example, if you want to look at <code>Sample.bam</code>, you can look for it like this: find /path/to/the/jobstore -name "Sample.bam" If you want to find files that were ''uploaded'' from a job, look for lines like this in the job's log: [2023-07-16T15:58:39-0700] [MainThread] [D] [toil.wdl.wdltoil] Virtualized /data/tmp/2846b6012e3e5535add03b363950dd78/cb23/197c/work/bamPerChrs/Sample.chr14.bam as WDL file toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam You can take the <code>toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam</code> URI and URL-decode it with, for example, [https://www.urldecoder.io/], getting this: toilfile:2703483274:0:files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam Then you can take the part after the last colon, <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam</code>, and that is the path relative to the job store where this file can be found. ==Using Development Versions of Toil== Sometimes, bugs will be fixed in the development version of Toil, but not released yet. To try the current development version of Toil, you can install it like this: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl,aws,google]' If you want to use a particular branch or commit, like <code>aaa451b320fc115b3563ced25cb501301cf86f90</code>, you can do: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git@aaa451b320fc115b3563ced25cb501301cf86f90#egg=toil[wdl,aws,google]' ==Frequently Asked Questions== ===I am getting warnings about <code>XDG_RUNTIME_DIR</code>=== You may be seeing warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code> You should upgrade Toil. [https://github.com/DataBiosphere/toil/commit/ff6bf60ab798a675c20156c749817c4313644b96 Since Toil 6.1.0], Toil no longer issues this warning, and just puts up with bad <code>XDG_RUNTIME_DIR</code> settings. ===Toil said it was <code>Redirecting logging</code> somewhere, but I can't find that file!=== The Toil worker process for each job will say that it is <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code>, and when running in single machine mode these messages go to the main Toil log. The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil. If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node. =Additional WDL resources= For more information on writing and running WDL workflows, see: * [https://docs.openwdl.org/en/stable/ The WDL dcoumentation] * [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube] * [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification] 649c1d3c62a9a3924a66c7e94998f530a4f558b9 645 644 2025-02-14T19:09:08Z Anovak 4 /* Installing Toil with WDL support */ wikitext text/x-wiki '''Tutorial: Getting Started with WDL Workflows on Phoenix''' Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you. Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order. This tutorial will help you get started writing and running workflows. The '''Phoenix Cluster Setup''' section is specifically for the UC Santa Cruz Genomics Institute's Phoenix Slurm cluster. The other sections are broadly applicable to other environments. By the end, you will be able to run workflows on Slurm with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong. =Phoenix Cluster Setup= Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH. ==Getting VPN access== We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall. To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster. ==Connecting to Phoenix== Once you have VPN access, you can connect to any of the machines with access to the Phoenix cluster. These interactive nodes are fairly large machines that can do some work locally, but you will still want to run larger workflows on the actual cluster. For this tutorial, we will use <code>emerald.prism</code> as our login node. To connect to the cluster: 1. Connect to the VPN. 2. SSH to <code>emerald.prism</code>. At the command line, run: ssh emerald.prism If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run: ssh flastname@emerald.prism The first time you connect, you will see a message like: The authenticity of host 'emerald.prism (10.50.1.67)' can't be established. ED25519 key fingerprint is SHA256:8hJQShO6jhrym9UVyMldKsKOnOFtWRChgjK5cZNhkAI. This key is not known by any other names. Are you sure you want to continue connecting (yes/no/[fingerprint])? This is your computer asking you to help it decide if it is talking to the genuine <code>emerald.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:8hJQShO6jhrym9UVyMldKsKOnOFtWRChgjK5cZNhkAI.</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is. ==Installing Toil with WDL support== Once you are on the head node, you can install Toil, a program for running workflows. Toil is written in Python, and the modern way to install Python command line tools is with pipx. So [https://pipx.pypa.io/latest/installation/ install pipx]: python3 -m pip install --user pipx python3 -m pipx ensurepath This may instruct you to **log out and log back in** or take some other action to adopt the new <code>PATH</code> settings. When installing Toil, you need to specify that you want WDL support. To do this, you can run: pipx install 'toil[wdl]' If you also want to use AWS S3 <code>s3://</code> and/or Google <code>gs://</code> URLs for data, you will need to also install Toil with the <code>aws</code> and <code>google</code> extras, respectively: pipx install 'toil[wdl,aws,google]' To change what extras are used when you have an existing Toil installation, you will need to use the <code>--force</code> option. This will install Toil in the <code>.local</code> directory inside your home directory, which we write as <code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>. The <code>python3 -m pipx ensurepath</code> command should have added the <code>~/.local</code> directory to your <code>PATH</code> environment variable, to ensure you can find these commands. If you see something from <code>pipx</code> like: - cwltoil (symlink missing or pointing to unexpected location) Then <code>pipx uninstall toil</code>, remove the offending file from <code>~/.local/bin</code>, and try again. To make sure it worked, you can run: toil-wdl-runner --help If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports. If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above. ==Configuring your Phoenix Environment== '''Do not try and store data in your home directory on Phoenix!''' The home directories are meant for code and programs. Any data worth running a workflow on should be in a directory under <code>/private/groups</code>. You will probably need to email the admins to get added to a group so you can create a directory to work in somewhere under <code>/private/groups</code>. Usually you would end up with <code>/private/groups/YOURGROUPNAME/YOURUSERNAME</code>. Remember this path; we will need it later. ==Configuring Toil for Phoenix== Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. However, since these files can be large, and the home directory quota is only 30 GB, we might not be able to keep these in your home directory. We would like to be able to store these on the cluster's large storage array, under <code>/private/groups</code>. However, Toil needs to use file locks in these directories to prevent simultaneous Singularity calls from producing internal Singularity errors, and Ceph currently has [https://tracker.ceph.com/issues/65607 a bug where these file locking operations can freeze the Ceph servers]. If you have '''a small number of container images''' that will fit in your home directory, you can keep them there. [https://github.com/DataBiosphere/toil/commit/cb0b291bb7f6212bfe69221dd9f09d72f83e92fb Since Toil 6.1.0], this is the default behavior and you don't need to do anything. (Unless you previously set <code>SINGULARITY_CACHEDIR</code> or <code>MINIWDL__SINGULARITY__IMAGE_CACHE</code>, in which case you need to unset them.) '''If you don't have room in your home directory''' for container images, currently the recommended approach is to use node-local storage under <code>/data/tmp</code>. This results in each node pulling each container image, but images will be saved across workflows. You can set that up for all your workflows with: echo 'export SINGULARITY_CACHEDIR="/data/tmp/$(whoami)/cache/singularity"' >>~/.bashrc echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="/data/tmp/$(whoami)/cache/miniwdl"' >>~/.bashrc Then '''log out and log back in again''', to apply the changes. =Running an existing workflow= First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project]. First, go to your user directory under <code>/private/groups</code>, and make a directory to work in. cd /private/groups/YOURGROUPNAME/YOURUSERNAME mkdir workflow-test cd workflow-test Next, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally. wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl ==Preparing an input file== Near the top of the WDL file, there's a section like this: workflow hello_caller { input { File who } This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one. So first, we have to make that list of names. Let's make it in <code>names.txt</code> echo "Mridula Resurrección" >names.txt echo "Gershom Šarlota" >>names.txt echo "Ritchie Ravi" >>names.txt Then, we need to create an ''inputs file'', which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this: echo '{"hello_caller.who": "./names.txt"}' >inputs.json Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification]. ==Testing at small scale single-machine== We are now ready to run the workflow! You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running: srun -c 2 --mem 8G --time=02:00:00 --partition=medium --pty bash -i This will start a new shell that can run for 2 hours; to leave it and go back to the head node you can use <code>exit</code>. In your new shell, run this Toil command: toil-wdl-runner self_test.wdl inputs.json -o local_run This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>. This will print a lot of logging to standard error, and to standard output it will print: {"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]} The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person. To leave your interactive Slurm session and return to the head node, use <code>exit</code>. ==Running at larger scale== Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people! Go get this handy list of people and cut it to length: wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt head -n100 1000_names.txt >100_names.txt And make a new inputs file: echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster. To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. Until Toil [https://github.com/DataBiosphere/toil/issues/4775 gets support for data file caching on Slurm], we will also need the <code>--caching false</code> option. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it. Additionally, since [https://github.com/DataBiosphere/toil/issues/4686 Toil can't manage Slurm partitions itself], we will use the <code>TOIL_SLURM_ARGS</code> environment variable to tell Toil how long jobs should be allowed to run for (2 hours) and what [[Slurm Queues (Partitions) and Resource Management | partition]] they should go in. mkdir -p logs export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium" toil-wdl-runner --jobStore ./big_store --batchSystem slurm --caching false --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory. =Writing your own workflow= In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz]. ==Writing the file== ===Version=== All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files). Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0. So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement: version 1.0 ===Workflow Block=== Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>. version 1.0 workflow FizzBuzz { } ===Input Block=== Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } } Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is ''not'' optional, and there is no default value, then the user's inputs file ''must'' specify a value for it in order for the workflow to run. ===Body=== Now we'll start on the body of the workflow, to be inserted just after the inputs section. The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable. Array[Int] numbers = range(item_count) WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more. ===Scattering=== Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 } ===Conditionals=== Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array. Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only ''once'' in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>. Note that WDL supports conditional ''expressions'' with a <code>then</code> and an <code>else</code>, but conditional ''statements'' only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition. So first, let's handle the special cases. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. } } ===Calling Tasks=== Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later. To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]) } We can put the code into the workflow now, and set about writing the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } ===Writing Tasks=== Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>. task stringify_number { } We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections. task stringify_number { input { Int the_number } # ??? output { String the_string # = ??? } } Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string # = ??? } } Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]). task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } } We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage. Then we can put our task into our WDL file: version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } ===Output Block=== Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren't supposed to need this, but you do need it in 1.1 and Toil doesn't actually send your outputs anywhere yet if you don't have one, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } output { Array[String] fizzbuzz_results = result } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array. ==Running the Workflow== Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs: echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json Then run it on the cluster with Toil: mkdir -p logs export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium" toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --caching false --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json Or locally: toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json =Debugging Workflows= Sometimes, your workflow won't work. Try these ideas for figuring out what is going wrong. ==Restarting the Workflow== If you think your workflow failed from a transient problem (such as a Docker image not being available) that you have fixed, and you ran the workflow with <code>--jobStore</code> set manually to a directory that persists between attempts, you can add <code>--restart</code> to your workflow command and make Toil try again. It will pick up from where it left off and rerun any failed tasks and then the rest of the workflow. This will ''will not'' pick up any changes to your WDL source code files; those are read once at the beginning and not re-read on restart. If restarting the workflow doesn't help, you may need to move on to more advanced debugging techniques. ==Debugging Options== When debugging a workflow, make sure to run the workflow with <code>--logDebug</code>, to set the log level to <code>DEBUG</code>, and with <code>--jobStore /some/path/to/a/shared/directory/it/can/create</code> so that the stored files shipped between jobs are in a place you can access them. When debug logging is on, the log from every Toil job is inserted in the main Toil log between these markers: =========> Toil job log is here <========= Normally, only the logs of failing jobs and the output of commands run from WDL are reproduced like this. ==Reading the Log== When a WDL workflow fails, you are likely to see a message like this: WDL.runtime.error.CommandFailed: task command failed with exit status 1 [2023-07-16T16:23:54-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host phoenix-15.prism This means that the command line command specified by one of your WDL tasks exited with a failing (i.e. nonzero) exit code, which will happen when either the command line command is written wrong, or when the error detection code in the tool you are trying to run detects and reports an error. Go up higher in the log until you find lines that look like: [2024-01-16T20:12:19-0500] [Thread-3 (statsAndLoggingAggregator)] [I] [toil.statsAndLogging] hello_caller.0.hello.stderr follows: And [2024-01-16T20:12:19-0500] [Thread-3 (statsAndLoggingAggregator)] [I] [toil.statsAndLogging] hello_caller.0.hello.stdout follows: These will be followed by the standard error and standard output log data from the task's command. There may be useful information (such as an error message from the underlying tool) in there. If you would like individual task logs to be saved separately for later reference, you can use the <code>--writeLogs</code> option to specify a directory to store them. For more information, see [https://toil.readthedocs.io/en/latest/wdl/running.html#managing-workflow-logs the Toil documentation of workflow task logs]. ==Reproducing Problems== When trying to fix a failing step, it is useful to be able to run a command outside of Toil or WDL that might reproduce the problem. In addition to getting the standard output and standard error logs as described above, you may also need input files for your tool in order to do this. === Automatically Fetching Input Files === The <code>toil debug-job</code> command has a <code>--retrieveTaskDirectory</code> option that lets you dump out a directory with all the files that a failing WDL task would use. You can use it like: toil debug-job ./jobstore WDLTaskJob --retrieveTaskDirectory dumpdir If there are multiple failing tasks, you might need to replace <code>WDLTaskJob</code> with the name of one of the failing jobs. See [https://toil.readthedocs.io/en/latest/running/debugging.html#fetching-job-inputs the Toil documentation on retrieving files] for more on how to use this command. === Manually Finding Input Files === If you can't use <code>toil debug-job</code>, you might need to manually dig through the job store for files. In the log of your failing Toil task, look for lines like this: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-4f886176ab8344baaf17dc72fc445445/toplog.sh' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmprwhi6h3q/toplog.sh' [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmpjyksfoko/Sample.bam' ... The <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam</code> part is a Toil file ID, and it is a relative path from your <code>--jobStore</code> value to where the file is stored on disk. So if you ran the workflow with <code>--jobStore /private/groups/patenlab/anovak/jobstore</code>, you would look for this file at: /private/groups/patenlab/anovak/jobstore/files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam ==More Ways of Finding Files== Sometimes, a step might not fail, but you still might want to see the files it is using as input. If you have the job store path, you can use the <code>find</code> command to try and find the files by name. For example, if you want to look at <code>Sample.bam</code>, you can look for it like this: find /path/to/the/jobstore -name "Sample.bam" If you want to find files that were ''uploaded'' from a job, look for lines like this in the job's log: [2023-07-16T15:58:39-0700] [MainThread] [D] [toil.wdl.wdltoil] Virtualized /data/tmp/2846b6012e3e5535add03b363950dd78/cb23/197c/work/bamPerChrs/Sample.chr14.bam as WDL file toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam You can take the <code>toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam</code> URI and URL-decode it with, for example, [https://www.urldecoder.io/], getting this: toilfile:2703483274:0:files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam Then you can take the part after the last colon, <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam</code>, and that is the path relative to the job store where this file can be found. ==Using Development Versions of Toil== Sometimes, bugs will be fixed in the development version of Toil, but not released yet. To try the current development version of Toil, you can install it like this: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl,aws,google]' If you want to use a particular branch or commit, like <code>aaa451b320fc115b3563ced25cb501301cf86f90</code>, you can do: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git@aaa451b320fc115b3563ced25cb501301cf86f90#egg=toil[wdl,aws,google]' ==Frequently Asked Questions== ===I am getting warnings about <code>XDG_RUNTIME_DIR</code>=== You may be seeing warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code> You should upgrade Toil. [https://github.com/DataBiosphere/toil/commit/ff6bf60ab798a675c20156c749817c4313644b96 Since Toil 6.1.0], Toil no longer issues this warning, and just puts up with bad <code>XDG_RUNTIME_DIR</code> settings. ===Toil said it was <code>Redirecting logging</code> somewhere, but I can't find that file!=== The Toil worker process for each job will say that it is <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code>, and when running in single machine mode these messages go to the main Toil log. The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil. If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node. =Additional WDL resources= For more information on writing and running WDL workflows, see: * [https://docs.openwdl.org/en/stable/ The WDL dcoumentation] * [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube] * [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification] 9e9dea7fe5823103dae229ea3fe57a82541c98e2 Requirement for users to get GI VPN access 0 9 647 597 2025-02-20T20:47:12Z Weiler 3 wikitext text/x-wiki Before you are allowed access to our firewalled/secure area ("Prism"), you have to complete 3 items and provide the completed certificates or forms: '''1''': You must take and complete the NIH Public Security Refresher Course online. You must complete the course in a single continuous sitting: https://irtsectraining.nih.gov/public.aspx Click on the "Enter Public Training Portal" near the bottom of the page. The course is titled "2024 Information Security, Insider Threats, Privacy Awareness, Records Management and Emergency Preparedness Refresher". At the end you will be able to save the completion certificate that should have your name on it. '''2''': You need to sign the Genomics Institute VPN User Agreement (digital signature OK), located here for download: [[Media:GI_VPN_Policy.pdf]] '''3''': Please read and sign the last page of the NIH Genomic Data Sharing Policy agreement (digital signature OK), located here for download. By signing the document you agree that you have read and understand the policies described therein and that you agree to abide by those policies: [[Media:NIH_GDS_Policy.pdf]] When you have the three documents described above ready, please complete this form: https://app.smartsheet.com/b/form/a76dbd90ba0240ab9ea9d39b390586ce. There are two parts in this process. 1. For the user, please fill in ALL required fields '''and attach''' all three required documents described above. The form then goes to your PI for approval - remind them to approve it, or it won't get sent to us for processing! 2. For the Sponsor/PI - you will receive an email from Smartsheets. Please fill in all required fields and submit. We will receive your completed request and we will create your account, then you will receive a welcome email with instructions on how to configure your VPN client and gain access to our systems. When using the VPN software off-campus, it will usually work unless the wireless network you are on has restrictions preventing it from functioning. Some other universities have such restrictions (notably UCSF), but most other wireless network and home wireless networks should work fine. '''PLEASE NOTE:''' Because of the overhead required in setting up VPN access, please only request access if you have an immediate need to work on data that exists behind the firewall. We have had a decent number of people request access and go through the setup but then never use it. In other words, please do not request access because "one day you might need it", but because you '''do''' actually need it! '''ALSO NOTE:''' VPN accounts typically expire after one year from the date of first gaining access. To renew for another year you will need your PI/sponsor to send us a note asking for renewal. c30fda3ee58c3d9fe856694f63f9b359a8e82bf5 Resetting your VPN/PRISM Password 0 60 648 554 2025-02-20T22:45:52Z Weiler 3 wikitext text/x-wiki If you have forgotten you VPN password (which is also your PRISM UNIX password), send an email to '''cluster-admin@soe.ucsc.edu''' requesting that your password be reset (include your username in the request). Once we have sent you your new temporary password, you will need to: 1: Log into the PRISM VPN using this new temporary password. 2: Log into one of the servers behind the firewall (mustard, emerald, crimson or razzmatazz) using you new temporary password. 3: Once you log in there, it should ask you to type in your temporary password one more time, then it will ask you to choose a new password. If it does not ask you to change your password (because you are logging in with SSH public keys), use the '''passwd''' command to change your password. Once you choose a new password (and type it twice for confirmation), log out of your SSH session. '''NOTE:''' Your new password must be 10 characters long, using three or more character classes (lowercase, uppercase, number or special character). 4: Log out (disconnect) from the VPN. '''This step is very important!''' 5: Log back into the VPN using your '''new''' password that you chose in step 3. 6: Log back into one of the servers (mustard, emerald, crimson or razzmatazz) using your new password. Assuming all that works, your password has been reset. You cannot reset your password to one of the prior five passwords you have used for your account. 1cadb808490eaf5ff257f1cab142eabaed9fbd0b Duo Pushes Aren't Being Sent to My Phone! 0 73 654 2025-03-15T15:31:17Z Weiler 3 Created page with "If you are having an issue such that you are trying to login to the Genomics Institute MFA VPN service and you are typing your username and password correctly, but you aren't receiving a Duo Push on your phone (and then the login times out), follow these steps to troubleshoot it. First, did you enroll your phone in Duo MFA when you set up your CruzID? If not, follow these instructions to get started: https://its.ucsc.edu/mfa/enroll.html If you have already done that..." wikitext text/x-wiki If you are having an issue such that you are trying to login to the Genomics Institute MFA VPN service and you are typing your username and password correctly, but you aren't receiving a Duo Push on your phone (and then the login times out), follow these steps to troubleshoot it. First, did you enroll your phone in Duo MFA when you set up your CruzID? If not, follow these instructions to get started: https://its.ucsc.edu/mfa/enroll.html If you have already done that and you have successfully received Duo Pushes in the past, then follow these steps to debug it: # If you are on Wifi, try disabling Wifi and just use your phone’s cellular connection. Then try logging in again. # Make sure that notifications are enabled on the Duo App. Sometimes they weirdly “disable”, and the pushes don’t come in. # Make sure your phone isn’t in “Do Not Disturb” or “Focus” mode. Sometimes folks have Focus/Do Not Disturb turn on at a certain time of night, which can cause Duo to stop working. If that was the case with you, it may work during the daytime but not in the evening. # Reboot your phone! You never know. # Double check that the time and date are correct on your phone. If they aren’t, Duo stops working. # We’re getting to the bottom of the barrel here. Try “pulling down” on the Duo App screen to see if it refreshes any pending notifications. # If none of that works, then your may have to re-initialize Duo altogether on your phone, which we can help with. e33bd71d12f3e84296ad0d67b3635daae34b3bb3 655 654 2025-03-15T15:31:35Z Weiler 3 wikitext text/x-wiki If you are having an issue such that you are trying to login to the Genomics Institute MFA VPN service and you are typing your username and password correctly, but you aren't receiving a Duo Push on your phone (and then the login times out), follow these steps to troubleshoot it. First, did you enroll your phone in Duo MFA when you set up your CruzID? If not, follow these instructions to get started: https://its.ucsc.edu/mfa/enroll.html If you have already done that and you have successfully received Duo Pushes in the past, then follow these steps to debug it: # If you are on Wifi, try disabling Wifi and just use your phone’s cellular connection. Then try logging in again. # Make sure that notifications are enabled on the Duo App. Sometimes they weirdly “disable”, and the pushes don’t come in. # Make sure your phone isn’t in “Do Not Disturb” or “Focus” mode. Sometimes folks have Focus/Do Not Disturb turn on at a certain time of night, which can cause Duo to stop working. If that was the case with you, it may work during the daytime but not in the evening. # Reboot your phone! You never know. # Double check that the time and date are correct on your phone. If they aren’t, Duo stops working. # We’re getting to the bottom of the barrel here. Try “pulling down” on the Duo App screen to see if it refreshes any pending notifications. # If none of that works, then your may have to re-initialize Duo altogether on your phone, which we can help with. e609fd599b2f20067ee83b4413b248f20b6288d0 656 655 2025-03-15T15:31:50Z Weiler 3 wikitext text/x-wiki If you are having an issue such that you are trying to login to the Genomics Institute MFA VPN service and you are typing your username and password correctly, but you aren't receiving a Duo Push on your phone (and then the login times out), follow these steps to troubleshoot it. First, did you enroll your phone in Duo MFA when you set up your CruzID? If not, follow these instructions to get started: https://its.ucsc.edu/mfa/enroll.html If you have already done that and you have successfully received Duo Pushes in the past, then follow these steps to debug it: # If you are on Wifi, try disabling Wifi and just use your phone’s cellular connection. Then try logging in again. # Make sure that notifications are enabled on the Duo App. Sometimes they weirdly “disable”, and the pushes don’t come in. # Make sure your phone isn’t in “Do Not Disturb” or “Focus” mode. Sometimes folks have Focus/Do Not Disturb turn on at a certain time of night, which can cause Duo to stop working. If that was the case with you, it may work during the daytime but not in the evening. # Reboot your phone! You never know. # Double check that the time and date are correct on your phone. If they aren’t, Duo stops working. # We’re getting to the bottom of the barrel here. Try “pulling down” on the Duo App screen to see if it refreshes any pending notifications. # If none of that works, then your may have to re-initialize Duo altogether on your phone, which we can help with. e33bd71d12f3e84296ad0d67b3635daae34b3bb3 Converting From Non-MFA VPN to the MFA-Enabled VPN on Windows 0 72 657 637 2025-03-17T14:17:49Z Anovak 4 wikitext text/x-wiki If you are using OpenVPN Connect on Windows 10 or 11 to connect to the GI VPN, and you are looking to convert to the new MFA-enabled GI VPN, you have come to the right place. You must already have Duo set up with your CruzID (which most of you do). If for some reason you don't have Duo set up yet on your phone, go here to enroll a device and configure Push Notifications with Duo before continuing: https://its.ucsc.edu/mfa/enroll.html OK! Let's get to it. Disconnect from the VPN if you are already connected. Then you will need to download the new OpenVPN config file from here: https://giwiki.gi.ucsc.edu/downloads/prism-duo.ovpn The credentials to access that website are username: '''genecats''' and password: '''KiloKluster''' Download that file by right-clicking on the link above and selecting "Save Link As...", and save it to your Desktop or some other area you will remember.. Launch the '''OpenVPN GUI''' app (usually there is an icon for it on your Desktop, but you can search for it if not). It will launch and appear in your system tray on the bottom right (the system tray icon kind of looks like a '''^''' icon). You should see the OpenVPN icon there, it looks like a little computer screen with a lock on it. Right click on the OpenVPN icon in the system tray, and you should see a small menu appear. Select "Import file". In the resulting window, browse to your Desktop or wherever you saved the '''prism-duo.ovpn''' file. Select that file an click "Open". Once you import the file, you should be able to right click on OpenVPN Connect again in the system tray and select the profile you want to connect to. It should show multiple profiles, one for your old profile and one for your new profile. Select the new one, then select "Connect". That's it! It will ask you for your usual GI PRISM username and password that you usually use to connect to our VPN, and after that it will send a Duo Push notification to your phone, and then you should be logged in. Other than the Duo Push, the VPN behaves exactly like it did before. If you need to use an authentication method other than Duo Push, you can append a comma, and then the name of the method (like "push", "sms", or "phone"), or a numeric second factor code, to you password when you submit it. If you have issues you can always revert back to the old configuration, which will still work for a while. We will disable the old VPN soon though, so make every effort to get the new VPN setup working. As always, please email '''cluster-admin@soe.ucsc.edu''' if you need help or have any questions. 868a85f48448452577598342ab85c132f84f9c4f 661 657 2025-03-17T14:19:14Z Anovak 4 wikitext text/x-wiki If you are using OpenVPN Connect on Windows 10 or 11 to connect to the GI VPN, and you are looking to convert to the new MFA-enabled GI VPN, you have come to the right place. You must already have Duo set up with your CruzID (which most of you do). If for some reason you don't have Duo set up yet on your phone, go here to enroll a device and configure Push Notifications with Duo before continuing: https://its.ucsc.edu/mfa/enroll.html OK! Let's get to it. Disconnect from the VPN if you are already connected. Then you will need to download the new OpenVPN config file from here: https://giwiki.gi.ucsc.edu/downloads/prism-duo.ovpn The credentials to access that website are username: '''genecats''' and password: '''KiloKluster''' Download that file by right-clicking on the link above and selecting "Save Link As...", and save it to your Desktop or some other area you will remember.. Launch the '''OpenVPN GUI''' app (usually there is an icon for it on your Desktop, but you can search for it if not). It will launch and appear in your system tray on the bottom right (the system tray icon kind of looks like a '''^''' icon). You should see the OpenVPN icon there, it looks like a little computer screen with a lock on it. Right click on the OpenVPN icon in the system tray, and you should see a small menu appear. Select "Import file". In the resulting window, browse to your Desktop or wherever you saved the '''prism-duo.ovpn''' file. Select that file an click "Open". Once you import the file, you should be able to right click on OpenVPN Connect again in the system tray and select the profile you want to connect to. It should show multiple profiles, one for your old profile and one for your new profile. Select the new one, then select "Connect". That's it! It will ask you for your usual GI PRISM username and password that you usually use to connect to our VPN, and after that it will send a Duo Push notification to your phone, and then you should be logged in. Other than the Duo Push, the VPN behaves exactly like it did before. If you need to use an authentication method other than Duo Push, you can append a comma, and then the name of the method (like "push", "sms", or "phone"), or a second factor code, to you password when you submit it. If you have issues you can always revert back to the old configuration, which will still work for a while. We will disable the old VPN soon though, so make every effort to get the new VPN setup working. As always, please email '''cluster-admin@soe.ucsc.edu''' if you need help or have any questions. 9cac9c2ee58f23c50aa7c7b47810f8eaa70081a7 Converting From Non-MFA VPN to the MFA-Enabled VPN on MacOS 0 70 658 652 2025-03-17T14:18:16Z Anovak 4 wikitext text/x-wiki If you are using Tunnelblick on MacOS and you are looking to convert to the new MFA-enabled GI VPN, you have come to the right place. You must already have Duo set up with your CruzID (which most of you do). If for some reason you don't have Duo set up yet on your phone, go here to enroll a device and configure Push Notifications with Duo before continuing: https://its.ucsc.edu/mfa/enroll.html OK! Let's get to it. Disconnect from the VPN if you are already connected. Then you will need to download the new OpenVPN config file from here: https://giwiki.gi.ucsc.edu/downloads/ The credentials to access that website are username: '''genecats''' and password: '''KiloKluster''' Go to the link above and right-click on '''prism-duo.ovpn''' and selecting "Save Link As...", and save it to your Desktop or some other area you will remember. Then open Tunnelblick and click on the Tunnelblick icon on the top right of your screen next to the date. It kind of looks like a small tunnel. In the window that opens, select "VPN Details...". In the resulting window, select the "Configurations" tab on the top. You will see a list of Configurations on the left, and it should include the current configuration you use to connect. It may be called 'prism' or maybe 'client'. Drag the new configuration called '''prism-duo.ovpn''' (from your Desktop) into the Configurations area beneath your old configuration. It should import the configuration. It will ask you if you want to install it for "Only You" or "All Users". Click "Only You". You will also be asked to type in your laptop password. That's it! Select the new configuration on the left and click the "Connect" button on the bottom right. It will ask you for your usual GI PRISM username and password that you usually use to connect to our VPN, and after that it will send a Duo Push notification to your phone, and then you should be logged in. Other than the Duo Push, the VPN behaves exactly like it did before. If you need to use an authentication method other than Duo Push, you can append a comma, and then the name of the method (like "push", "sms", or "phone"), or a numeric second factor code, to you password when you submit it. If you have issues you can always revert back to the old configuration, which will still work for a while. We will disable the old VPN soon though, so make every effort to get the new VPN setup working. Once you have the new VPN working, feel free to delete the old profile from Tunnelblick by clicking on the old profile in the "Configurations" window, then click on the '''"-"''' button below to remove the old configuration. As always, please email '''cluster-admin@soe.ucsc.edu''' if you need help or have any questions. 6520f984059fa6952cf777d133675225cc94ae51 660 658 2025-03-17T14:18:57Z Anovak 4 wikitext text/x-wiki If you are using Tunnelblick on MacOS and you are looking to convert to the new MFA-enabled GI VPN, you have come to the right place. You must already have Duo set up with your CruzID (which most of you do). If for some reason you don't have Duo set up yet on your phone, go here to enroll a device and configure Push Notifications with Duo before continuing: https://its.ucsc.edu/mfa/enroll.html OK! Let's get to it. Disconnect from the VPN if you are already connected. Then you will need to download the new OpenVPN config file from here: https://giwiki.gi.ucsc.edu/downloads/ The credentials to access that website are username: '''genecats''' and password: '''KiloKluster''' Go to the link above and right-click on '''prism-duo.ovpn''' and selecting "Save Link As...", and save it to your Desktop or some other area you will remember. Then open Tunnelblick and click on the Tunnelblick icon on the top right of your screen next to the date. It kind of looks like a small tunnel. In the window that opens, select "VPN Details...". In the resulting window, select the "Configurations" tab on the top. You will see a list of Configurations on the left, and it should include the current configuration you use to connect. It may be called 'prism' or maybe 'client'. Drag the new configuration called '''prism-duo.ovpn''' (from your Desktop) into the Configurations area beneath your old configuration. It should import the configuration. It will ask you if you want to install it for "Only You" or "All Users". Click "Only You". You will also be asked to type in your laptop password. That's it! Select the new configuration on the left and click the "Connect" button on the bottom right. It will ask you for your usual GI PRISM username and password that you usually use to connect to our VPN, and after that it will send a Duo Push notification to your phone, and then you should be logged in. Other than the Duo Push, the VPN behaves exactly like it did before. If you need to use an authentication method other than Duo Push, you can append a comma, and then the name of the method (like "push", "sms", or "phone"), or a second factor code, to you password when you submit it. If you have issues you can always revert back to the old configuration, which will still work for a while. We will disable the old VPN soon though, so make every effort to get the new VPN setup working. Once you have the new VPN working, feel free to delete the old profile from Tunnelblick by clicking on the old profile in the "Configurations" window, then click on the '''"-"''' button below to remove the old configuration. As always, please email '''cluster-admin@soe.ucsc.edu''' if you need help or have any questions. f03203ca27d92a8f4af88ef93250b4d4911b5eb8 Converting From Non-MFA VPN to the MFA-Enabled VPN on Linux 0 71 659 638 2025-03-17T14:18:44Z Anovak 4 wikitext text/x-wiki If you are using OpenVPN on Linux to connect to the GI VPN and you are looking to convert to the new MFA-enabled GI VPN, you have come to the right place. You must already have Duo set up with your CruzID (which most of you do). If for some reason you don't have Duo set up yet on your phone, go here to enroll a device and configure Push Notifications with Duo before continuing: https://its.ucsc.edu/mfa/enroll.html OK! Let's get to it. Disconnect from the VPN if you are already connected. All the various flavors and versions of Linux vary in the specifics, so you may not be following these exact instructions to get it to work. This is based on the Network Manager in Ubuntu, but most Ubuntu/Debian variants will be similar. Then you will need to download the new OpenVPN config file from here: https://giwiki.gi.ucsc.edu/downloads/prism-duo.ovpn The credentials to access that website are username: '''genecats''' and password: '''KiloKluster''' Download that file right-clicking on the link above and selecting "Save Link As...", and save it to your Desktop or some other area you will remember. or some other easy to remember location. We will be installing the Prism VPN profile via the Network Manager GUI interface. Open '''Network Manager''' from '''Gnome Settings''' option and select the '''Network''' tab and click on the '''VPN +''' symbol: [[File:Configuring_1.png|600px]] From the '''Add VPN''' window, click on the '''Import from file...''' option: [[File:Configuring_2.png|600px]] You must navigate to your .ovpn file (/path/to/your/prism-duo.ovpn) and click on '''Open''' button: [[File:Configuring_3.png|600px]] Click on the '''Add''' button: [[File:Configuring_4.png|600px]] Finally, click the '''On/Off''' button to start on the new VPN: [[File:Configuring_5.png|600px]] That's it! It will ask you for your usual GI PRISM username and password that you usually use to connect to our VPN, and after that it will send a Duo Push notification to your phone, and then you should be logged in. Other than the Duo Push, the VPN behaves exactly like it did before. If you need to use an authentication method other than Duo Push, you can append a comma, and then the name of the method (like "push", "sms", or "phone"), or a second factor code, to you password when you submit it. If you have issues you can always revert back to the old configuration, which will still work for a while. We will disable the old VPN soon though, so make every effort to get the new VPN setup working. Once you have the new VPN working, feel free to delete the old profile from the Network Manager. As always, please email '''cluster-admin@soe.ucsc.edu''' if you need help or have any questions. b78ad3ddb1b152cd49c9ab0b861b117021d800aa Phoenix WDL Tutorial 0 45 662 645 2025-03-18T18:42:50Z Anovak 4 /* Frequently Asked Questions */ wikitext text/x-wiki '''Tutorial: Getting Started with WDL Workflows on Phoenix''' Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you. Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order. This tutorial will help you get started writing and running workflows. The '''Phoenix Cluster Setup''' section is specifically for the UC Santa Cruz Genomics Institute's Phoenix Slurm cluster. The other sections are broadly applicable to other environments. By the end, you will be able to run workflows on Slurm with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong. =Phoenix Cluster Setup= Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH. ==Getting VPN access== We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall. To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster. ==Connecting to Phoenix== Once you have VPN access, you can connect to any of the machines with access to the Phoenix cluster. These interactive nodes are fairly large machines that can do some work locally, but you will still want to run larger workflows on the actual cluster. For this tutorial, we will use <code>emerald.prism</code> as our login node. To connect to the cluster: 1. Connect to the VPN. 2. SSH to <code>emerald.prism</code>. At the command line, run: ssh emerald.prism If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run: ssh flastname@emerald.prism The first time you connect, you will see a message like: The authenticity of host 'emerald.prism (10.50.1.67)' can't be established. ED25519 key fingerprint is SHA256:8hJQShO6jhrym9UVyMldKsKOnOFtWRChgjK5cZNhkAI. This key is not known by any other names. Are you sure you want to continue connecting (yes/no/[fingerprint])? This is your computer asking you to help it decide if it is talking to the genuine <code>emerald.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:8hJQShO6jhrym9UVyMldKsKOnOFtWRChgjK5cZNhkAI.</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is. ==Installing Toil with WDL support== Once you are on the head node, you can install Toil, a program for running workflows. Toil is written in Python, and the modern way to install Python command line tools is with pipx. So [https://pipx.pypa.io/latest/installation/ install pipx]: python3 -m pip install --user pipx python3 -m pipx ensurepath This may instruct you to **log out and log back in** or take some other action to adopt the new <code>PATH</code> settings. When installing Toil, you need to specify that you want WDL support. To do this, you can run: pipx install 'toil[wdl]' If you also want to use AWS S3 <code>s3://</code> and/or Google <code>gs://</code> URLs for data, you will need to also install Toil with the <code>aws</code> and <code>google</code> extras, respectively: pipx install 'toil[wdl,aws,google]' To change what extras are used when you have an existing Toil installation, you will need to use the <code>--force</code> option. This will install Toil in the <code>.local</code> directory inside your home directory, which we write as <code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>. The <code>python3 -m pipx ensurepath</code> command should have added the <code>~/.local</code> directory to your <code>PATH</code> environment variable, to ensure you can find these commands. If you see something from <code>pipx</code> like: - cwltoil (symlink missing or pointing to unexpected location) Then <code>pipx uninstall toil</code>, remove the offending file from <code>~/.local/bin</code>, and try again. To make sure it worked, you can run: toil-wdl-runner --help If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports. If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above. ==Configuring your Phoenix Environment== '''Do not try and store data in your home directory on Phoenix!''' The home directories are meant for code and programs. Any data worth running a workflow on should be in a directory under <code>/private/groups</code>. You will probably need to email the admins to get added to a group so you can create a directory to work in somewhere under <code>/private/groups</code>. Usually you would end up with <code>/private/groups/YOURGROUPNAME/YOURUSERNAME</code>. Remember this path; we will need it later. ==Configuring Toil for Phoenix== Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. However, since these files can be large, and the home directory quota is only 30 GB, we might not be able to keep these in your home directory. We would like to be able to store these on the cluster's large storage array, under <code>/private/groups</code>. However, Toil needs to use file locks in these directories to prevent simultaneous Singularity calls from producing internal Singularity errors, and Ceph currently has [https://tracker.ceph.com/issues/65607 a bug where these file locking operations can freeze the Ceph servers]. If you have '''a small number of container images''' that will fit in your home directory, you can keep them there. [https://github.com/DataBiosphere/toil/commit/cb0b291bb7f6212bfe69221dd9f09d72f83e92fb Since Toil 6.1.0], this is the default behavior and you don't need to do anything. (Unless you previously set <code>SINGULARITY_CACHEDIR</code> or <code>MINIWDL__SINGULARITY__IMAGE_CACHE</code>, in which case you need to unset them.) '''If you don't have room in your home directory''' for container images, currently the recommended approach is to use node-local storage under <code>/data/tmp</code>. This results in each node pulling each container image, but images will be saved across workflows. You can set that up for all your workflows with: echo 'export SINGULARITY_CACHEDIR="/data/tmp/$(whoami)/cache/singularity"' >>~/.bashrc echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="/data/tmp/$(whoami)/cache/miniwdl"' >>~/.bashrc Then '''log out and log back in again''', to apply the changes. =Running an existing workflow= First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project]. First, go to your user directory under <code>/private/groups</code>, and make a directory to work in. cd /private/groups/YOURGROUPNAME/YOURUSERNAME mkdir workflow-test cd workflow-test Next, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally. wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl ==Preparing an input file== Near the top of the WDL file, there's a section like this: workflow hello_caller { input { File who } This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one. So first, we have to make that list of names. Let's make it in <code>names.txt</code> echo "Mridula Resurrección" >names.txt echo "Gershom Šarlota" >>names.txt echo "Ritchie Ravi" >>names.txt Then, we need to create an ''inputs file'', which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this: echo '{"hello_caller.who": "./names.txt"}' >inputs.json Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification]. ==Testing at small scale single-machine== We are now ready to run the workflow! You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running: srun -c 2 --mem 8G --time=02:00:00 --partition=medium --pty bash -i This will start a new shell that can run for 2 hours; to leave it and go back to the head node you can use <code>exit</code>. In your new shell, run this Toil command: toil-wdl-runner self_test.wdl inputs.json -o local_run This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>. This will print a lot of logging to standard error, and to standard output it will print: {"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]} The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person. To leave your interactive Slurm session and return to the head node, use <code>exit</code>. ==Running at larger scale== Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people! Go get this handy list of people and cut it to length: wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt head -n100 1000_names.txt >100_names.txt And make a new inputs file: echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster. To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. Until Toil [https://github.com/DataBiosphere/toil/issues/4775 gets support for data file caching on Slurm], we will also need the <code>--caching false</code> option. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it. Additionally, since [https://github.com/DataBiosphere/toil/issues/4686 Toil can't manage Slurm partitions itself], we will use the <code>TOIL_SLURM_ARGS</code> environment variable to tell Toil how long jobs should be allowed to run for (2 hours) and what [[Slurm Queues (Partitions) and Resource Management | partition]] they should go in. mkdir -p logs export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium" toil-wdl-runner --jobStore ./big_store --batchSystem slurm --caching false --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory. =Writing your own workflow= In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz]. ==Writing the file== ===Version=== All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files). Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0. So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement: version 1.0 ===Workflow Block=== Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>. version 1.0 workflow FizzBuzz { } ===Input Block=== Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } } Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is ''not'' optional, and there is no default value, then the user's inputs file ''must'' specify a value for it in order for the workflow to run. ===Body=== Now we'll start on the body of the workflow, to be inserted just after the inputs section. The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable. Array[Int] numbers = range(item_count) WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more. ===Scattering=== Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 } ===Conditionals=== Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array. Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only ''once'' in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>. Note that WDL supports conditional ''expressions'' with a <code>then</code> and an <code>else</code>, but conditional ''statements'' only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition. So first, let's handle the special cases. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. } } ===Calling Tasks=== Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later. To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]) } We can put the code into the workflow now, and set about writing the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } ===Writing Tasks=== Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>. task stringify_number { } We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections. task stringify_number { input { Int the_number } # ??? output { String the_string # = ??? } } Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string # = ??? } } Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]). task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } } We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage. Then we can put our task into our WDL file: version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } ===Output Block=== Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren't supposed to need this, but you do need it in 1.1 and Toil doesn't actually send your outputs anywhere yet if you don't have one, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } output { Array[String] fizzbuzz_results = result } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array. ==Running the Workflow== Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs: echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json Then run it on the cluster with Toil: mkdir -p logs export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium" toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --caching false --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json Or locally: toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json =Debugging Workflows= Sometimes, your workflow won't work. Try these ideas for figuring out what is going wrong. ==Restarting the Workflow== If you think your workflow failed from a transient problem (such as a Docker image not being available) that you have fixed, and you ran the workflow with <code>--jobStore</code> set manually to a directory that persists between attempts, you can add <code>--restart</code> to your workflow command and make Toil try again. It will pick up from where it left off and rerun any failed tasks and then the rest of the workflow. This will ''will not'' pick up any changes to your WDL source code files; those are read once at the beginning and not re-read on restart. If restarting the workflow doesn't help, you may need to move on to more advanced debugging techniques. ==Debugging Options== When debugging a workflow, make sure to run the workflow with <code>--logDebug</code>, to set the log level to <code>DEBUG</code>, and with <code>--jobStore /some/path/to/a/shared/directory/it/can/create</code> so that the stored files shipped between jobs are in a place you can access them. When debug logging is on, the log from every Toil job is inserted in the main Toil log between these markers: =========> Toil job log is here <========= Normally, only the logs of failing jobs and the output of commands run from WDL are reproduced like this. ==Reading the Log== When a WDL workflow fails, you are likely to see a message like this: WDL.runtime.error.CommandFailed: task command failed with exit status 1 [2023-07-16T16:23:54-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host phoenix-15.prism This means that the command line command specified by one of your WDL tasks exited with a failing (i.e. nonzero) exit code, which will happen when either the command line command is written wrong, or when the error detection code in the tool you are trying to run detects and reports an error. Go up higher in the log until you find lines that look like: [2024-01-16T20:12:19-0500] [Thread-3 (statsAndLoggingAggregator)] [I] [toil.statsAndLogging] hello_caller.0.hello.stderr follows: And [2024-01-16T20:12:19-0500] [Thread-3 (statsAndLoggingAggregator)] [I] [toil.statsAndLogging] hello_caller.0.hello.stdout follows: These will be followed by the standard error and standard output log data from the task's command. There may be useful information (such as an error message from the underlying tool) in there. If you would like individual task logs to be saved separately for later reference, you can use the <code>--writeLogs</code> option to specify a directory to store them. For more information, see [https://toil.readthedocs.io/en/latest/wdl/running.html#managing-workflow-logs the Toil documentation of workflow task logs]. ==Reproducing Problems== When trying to fix a failing step, it is useful to be able to run a command outside of Toil or WDL that might reproduce the problem. In addition to getting the standard output and standard error logs as described above, you may also need input files for your tool in order to do this. === Automatically Fetching Input Files === The <code>toil debug-job</code> command has a <code>--retrieveTaskDirectory</code> option that lets you dump out a directory with all the files that a failing WDL task would use. You can use it like: toil debug-job ./jobstore WDLTaskJob --retrieveTaskDirectory dumpdir If there are multiple failing tasks, you might need to replace <code>WDLTaskJob</code> with the name of one of the failing jobs. See [https://toil.readthedocs.io/en/latest/running/debugging.html#fetching-job-inputs the Toil documentation on retrieving files] for more on how to use this command. === Manually Finding Input Files === If you can't use <code>toil debug-job</code>, you might need to manually dig through the job store for files. In the log of your failing Toil task, look for lines like this: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-4f886176ab8344baaf17dc72fc445445/toplog.sh' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmprwhi6h3q/toplog.sh' [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmpjyksfoko/Sample.bam' ... The <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam</code> part is a Toil file ID, and it is a relative path from your <code>--jobStore</code> value to where the file is stored on disk. So if you ran the workflow with <code>--jobStore /private/groups/patenlab/anovak/jobstore</code>, you would look for this file at: /private/groups/patenlab/anovak/jobstore/files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam ==More Ways of Finding Files== Sometimes, a step might not fail, but you still might want to see the files it is using as input. If you have the job store path, you can use the <code>find</code> command to try and find the files by name. For example, if you want to look at <code>Sample.bam</code>, you can look for it like this: find /path/to/the/jobstore -name "Sample.bam" If you want to find files that were ''uploaded'' from a job, look for lines like this in the job's log: [2023-07-16T15:58:39-0700] [MainThread] [D] [toil.wdl.wdltoil] Virtualized /data/tmp/2846b6012e3e5535add03b363950dd78/cb23/197c/work/bamPerChrs/Sample.chr14.bam as WDL file toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam You can take the <code>toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam</code> URI and URL-decode it with, for example, [https://www.urldecoder.io/], getting this: toilfile:2703483274:0:files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam Then you can take the part after the last colon, <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam</code>, and that is the path relative to the job store where this file can be found. ==Using Development Versions of Toil== Sometimes, bugs will be fixed in the development version of Toil, but not released yet. To try the current development version of Toil, you can install it like this: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl,aws,google]' If you want to use a particular branch or commit, like <code>aaa451b320fc115b3563ced25cb501301cf86f90</code>, you can do: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git@aaa451b320fc115b3563ced25cb501301cf86f90#egg=toil[wdl,aws,google]' ==Frequently Asked Questions== ===I am getting warnings about <code>XDG_RUNTIME_DIR</code>=== You may be seeing warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code> You should upgrade Toil. [https://github.com/DataBiosphere/toil/commit/ff6bf60ab798a675c20156c749817c4313644b96 Since Toil 6.1.0], Toil no longer issues this warning, and just puts up with bad <code>XDG_RUNTIME_DIR</code> settings. ===Toil said it was <code>Redirecting logging</code> somewhere, but I can't find that file!=== The Toil worker process for each job will say that it is <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code>, and when running in single machine mode these messages go to the main Toil log. The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil. If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node. ===How do I delete files in WDL?=== WDL doesn't have a built-in way to delete files; if you run a task that deletes a file, it will still exist in Toil's job store storage. Toil [https://github.com/DataBiosphere/toil/commit/2de6eea2cc2e688b53062a98687445f0cca56669 recently gained support] for deleting files at the **end** of WDL workflows. So if you have a large file that you only need for part of your workflow, consider writing that part as a separate sub-<code>workflow</code> and invoking it with <code>call</code>. Then the file will be cleaned up when the child workflow ends, leaving more space fro files created in the parent workflow. =Additional WDL resources= For more information on writing and running WDL workflows, see: * [https://docs.openwdl.org/en/stable/ The WDL dcoumentation] * [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube] * [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification] b76c83ce48777e8417d1e89f6d4df58e4a5951d6 663 662 2025-03-18T18:43:03Z Anovak 4 /* How do I delete files in WDL? */ wikitext text/x-wiki '''Tutorial: Getting Started with WDL Workflows on Phoenix''' Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you. Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order. This tutorial will help you get started writing and running workflows. The '''Phoenix Cluster Setup''' section is specifically for the UC Santa Cruz Genomics Institute's Phoenix Slurm cluster. The other sections are broadly applicable to other environments. By the end, you will be able to run workflows on Slurm with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong. =Phoenix Cluster Setup= Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH. ==Getting VPN access== We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall. To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster. ==Connecting to Phoenix== Once you have VPN access, you can connect to any of the machines with access to the Phoenix cluster. These interactive nodes are fairly large machines that can do some work locally, but you will still want to run larger workflows on the actual cluster. For this tutorial, we will use <code>emerald.prism</code> as our login node. To connect to the cluster: 1. Connect to the VPN. 2. SSH to <code>emerald.prism</code>. At the command line, run: ssh emerald.prism If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run: ssh flastname@emerald.prism The first time you connect, you will see a message like: The authenticity of host 'emerald.prism (10.50.1.67)' can't be established. ED25519 key fingerprint is SHA256:8hJQShO6jhrym9UVyMldKsKOnOFtWRChgjK5cZNhkAI. This key is not known by any other names. Are you sure you want to continue connecting (yes/no/[fingerprint])? This is your computer asking you to help it decide if it is talking to the genuine <code>emerald.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:8hJQShO6jhrym9UVyMldKsKOnOFtWRChgjK5cZNhkAI.</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is. ==Installing Toil with WDL support== Once you are on the head node, you can install Toil, a program for running workflows. Toil is written in Python, and the modern way to install Python command line tools is with pipx. So [https://pipx.pypa.io/latest/installation/ install pipx]: python3 -m pip install --user pipx python3 -m pipx ensurepath This may instruct you to **log out and log back in** or take some other action to adopt the new <code>PATH</code> settings. When installing Toil, you need to specify that you want WDL support. To do this, you can run: pipx install 'toil[wdl]' If you also want to use AWS S3 <code>s3://</code> and/or Google <code>gs://</code> URLs for data, you will need to also install Toil with the <code>aws</code> and <code>google</code> extras, respectively: pipx install 'toil[wdl,aws,google]' To change what extras are used when you have an existing Toil installation, you will need to use the <code>--force</code> option. This will install Toil in the <code>.local</code> directory inside your home directory, which we write as <code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>. The <code>python3 -m pipx ensurepath</code> command should have added the <code>~/.local</code> directory to your <code>PATH</code> environment variable, to ensure you can find these commands. If you see something from <code>pipx</code> like: - cwltoil (symlink missing or pointing to unexpected location) Then <code>pipx uninstall toil</code>, remove the offending file from <code>~/.local/bin</code>, and try again. To make sure it worked, you can run: toil-wdl-runner --help If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports. If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above. ==Configuring your Phoenix Environment== '''Do not try and store data in your home directory on Phoenix!''' The home directories are meant for code and programs. Any data worth running a workflow on should be in a directory under <code>/private/groups</code>. You will probably need to email the admins to get added to a group so you can create a directory to work in somewhere under <code>/private/groups</code>. Usually you would end up with <code>/private/groups/YOURGROUPNAME/YOURUSERNAME</code>. Remember this path; we will need it later. ==Configuring Toil for Phoenix== Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. However, since these files can be large, and the home directory quota is only 30 GB, we might not be able to keep these in your home directory. We would like to be able to store these on the cluster's large storage array, under <code>/private/groups</code>. However, Toil needs to use file locks in these directories to prevent simultaneous Singularity calls from producing internal Singularity errors, and Ceph currently has [https://tracker.ceph.com/issues/65607 a bug where these file locking operations can freeze the Ceph servers]. If you have '''a small number of container images''' that will fit in your home directory, you can keep them there. [https://github.com/DataBiosphere/toil/commit/cb0b291bb7f6212bfe69221dd9f09d72f83e92fb Since Toil 6.1.0], this is the default behavior and you don't need to do anything. (Unless you previously set <code>SINGULARITY_CACHEDIR</code> or <code>MINIWDL__SINGULARITY__IMAGE_CACHE</code>, in which case you need to unset them.) '''If you don't have room in your home directory''' for container images, currently the recommended approach is to use node-local storage under <code>/data/tmp</code>. This results in each node pulling each container image, but images will be saved across workflows. You can set that up for all your workflows with: echo 'export SINGULARITY_CACHEDIR="/data/tmp/$(whoami)/cache/singularity"' >>~/.bashrc echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="/data/tmp/$(whoami)/cache/miniwdl"' >>~/.bashrc Then '''log out and log back in again''', to apply the changes. =Running an existing workflow= First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project]. First, go to your user directory under <code>/private/groups</code>, and make a directory to work in. cd /private/groups/YOURGROUPNAME/YOURUSERNAME mkdir workflow-test cd workflow-test Next, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally. wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl ==Preparing an input file== Near the top of the WDL file, there's a section like this: workflow hello_caller { input { File who } This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one. So first, we have to make that list of names. Let's make it in <code>names.txt</code> echo "Mridula Resurrección" >names.txt echo "Gershom Šarlota" >>names.txt echo "Ritchie Ravi" >>names.txt Then, we need to create an ''inputs file'', which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this: echo '{"hello_caller.who": "./names.txt"}' >inputs.json Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification]. ==Testing at small scale single-machine== We are now ready to run the workflow! You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running: srun -c 2 --mem 8G --time=02:00:00 --partition=medium --pty bash -i This will start a new shell that can run for 2 hours; to leave it and go back to the head node you can use <code>exit</code>. In your new shell, run this Toil command: toil-wdl-runner self_test.wdl inputs.json -o local_run This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>. This will print a lot of logging to standard error, and to standard output it will print: {"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]} The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person. To leave your interactive Slurm session and return to the head node, use <code>exit</code>. ==Running at larger scale== Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people! Go get this handy list of people and cut it to length: wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt head -n100 1000_names.txt >100_names.txt And make a new inputs file: echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster. To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. Until Toil [https://github.com/DataBiosphere/toil/issues/4775 gets support for data file caching on Slurm], we will also need the <code>--caching false</code> option. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it. Additionally, since [https://github.com/DataBiosphere/toil/issues/4686 Toil can't manage Slurm partitions itself], we will use the <code>TOIL_SLURM_ARGS</code> environment variable to tell Toil how long jobs should be allowed to run for (2 hours) and what [[Slurm Queues (Partitions) and Resource Management | partition]] they should go in. mkdir -p logs export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium" toil-wdl-runner --jobStore ./big_store --batchSystem slurm --caching false --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory. =Writing your own workflow= In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz]. ==Writing the file== ===Version=== All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files). Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0. So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement: version 1.0 ===Workflow Block=== Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>. version 1.0 workflow FizzBuzz { } ===Input Block=== Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } } Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is ''not'' optional, and there is no default value, then the user's inputs file ''must'' specify a value for it in order for the workflow to run. ===Body=== Now we'll start on the body of the workflow, to be inserted just after the inputs section. The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable. Array[Int] numbers = range(item_count) WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more. ===Scattering=== Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 } ===Conditionals=== Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array. Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only ''once'' in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>. Note that WDL supports conditional ''expressions'' with a <code>then</code> and an <code>else</code>, but conditional ''statements'' only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition. So first, let's handle the special cases. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. } } ===Calling Tasks=== Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later. To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]) } We can put the code into the workflow now, and set about writing the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } ===Writing Tasks=== Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>. task stringify_number { } We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections. task stringify_number { input { Int the_number } # ??? output { String the_string # = ??? } } Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string # = ??? } } Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]). task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } } We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage. Then we can put our task into our WDL file: version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } ===Output Block=== Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren't supposed to need this, but you do need it in 1.1 and Toil doesn't actually send your outputs anywhere yet if you don't have one, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } output { Array[String] fizzbuzz_results = result } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array. ==Running the Workflow== Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs: echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json Then run it on the cluster with Toil: mkdir -p logs export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium" toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --caching false --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json Or locally: toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json =Debugging Workflows= Sometimes, your workflow won't work. Try these ideas for figuring out what is going wrong. ==Restarting the Workflow== If you think your workflow failed from a transient problem (such as a Docker image not being available) that you have fixed, and you ran the workflow with <code>--jobStore</code> set manually to a directory that persists between attempts, you can add <code>--restart</code> to your workflow command and make Toil try again. It will pick up from where it left off and rerun any failed tasks and then the rest of the workflow. This will ''will not'' pick up any changes to your WDL source code files; those are read once at the beginning and not re-read on restart. If restarting the workflow doesn't help, you may need to move on to more advanced debugging techniques. ==Debugging Options== When debugging a workflow, make sure to run the workflow with <code>--logDebug</code>, to set the log level to <code>DEBUG</code>, and with <code>--jobStore /some/path/to/a/shared/directory/it/can/create</code> so that the stored files shipped between jobs are in a place you can access them. When debug logging is on, the log from every Toil job is inserted in the main Toil log between these markers: =========> Toil job log is here <========= Normally, only the logs of failing jobs and the output of commands run from WDL are reproduced like this. ==Reading the Log== When a WDL workflow fails, you are likely to see a message like this: WDL.runtime.error.CommandFailed: task command failed with exit status 1 [2023-07-16T16:23:54-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host phoenix-15.prism This means that the command line command specified by one of your WDL tasks exited with a failing (i.e. nonzero) exit code, which will happen when either the command line command is written wrong, or when the error detection code in the tool you are trying to run detects and reports an error. Go up higher in the log until you find lines that look like: [2024-01-16T20:12:19-0500] [Thread-3 (statsAndLoggingAggregator)] [I] [toil.statsAndLogging] hello_caller.0.hello.stderr follows: And [2024-01-16T20:12:19-0500] [Thread-3 (statsAndLoggingAggregator)] [I] [toil.statsAndLogging] hello_caller.0.hello.stdout follows: These will be followed by the standard error and standard output log data from the task's command. There may be useful information (such as an error message from the underlying tool) in there. If you would like individual task logs to be saved separately for later reference, you can use the <code>--writeLogs</code> option to specify a directory to store them. For more information, see [https://toil.readthedocs.io/en/latest/wdl/running.html#managing-workflow-logs the Toil documentation of workflow task logs]. ==Reproducing Problems== When trying to fix a failing step, it is useful to be able to run a command outside of Toil or WDL that might reproduce the problem. In addition to getting the standard output and standard error logs as described above, you may also need input files for your tool in order to do this. === Automatically Fetching Input Files === The <code>toil debug-job</code> command has a <code>--retrieveTaskDirectory</code> option that lets you dump out a directory with all the files that a failing WDL task would use. You can use it like: toil debug-job ./jobstore WDLTaskJob --retrieveTaskDirectory dumpdir If there are multiple failing tasks, you might need to replace <code>WDLTaskJob</code> with the name of one of the failing jobs. See [https://toil.readthedocs.io/en/latest/running/debugging.html#fetching-job-inputs the Toil documentation on retrieving files] for more on how to use this command. === Manually Finding Input Files === If you can't use <code>toil debug-job</code>, you might need to manually dig through the job store for files. In the log of your failing Toil task, look for lines like this: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-4f886176ab8344baaf17dc72fc445445/toplog.sh' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmprwhi6h3q/toplog.sh' [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmpjyksfoko/Sample.bam' ... The <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam</code> part is a Toil file ID, and it is a relative path from your <code>--jobStore</code> value to where the file is stored on disk. So if you ran the workflow with <code>--jobStore /private/groups/patenlab/anovak/jobstore</code>, you would look for this file at: /private/groups/patenlab/anovak/jobstore/files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam ==More Ways of Finding Files== Sometimes, a step might not fail, but you still might want to see the files it is using as input. If you have the job store path, you can use the <code>find</code> command to try and find the files by name. For example, if you want to look at <code>Sample.bam</code>, you can look for it like this: find /path/to/the/jobstore -name "Sample.bam" If you want to find files that were ''uploaded'' from a job, look for lines like this in the job's log: [2023-07-16T15:58:39-0700] [MainThread] [D] [toil.wdl.wdltoil] Virtualized /data/tmp/2846b6012e3e5535add03b363950dd78/cb23/197c/work/bamPerChrs/Sample.chr14.bam as WDL file toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam You can take the <code>toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam</code> URI and URL-decode it with, for example, [https://www.urldecoder.io/], getting this: toilfile:2703483274:0:files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam Then you can take the part after the last colon, <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam</code>, and that is the path relative to the job store where this file can be found. ==Using Development Versions of Toil== Sometimes, bugs will be fixed in the development version of Toil, but not released yet. To try the current development version of Toil, you can install it like this: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl,aws,google]' If you want to use a particular branch or commit, like <code>aaa451b320fc115b3563ced25cb501301cf86f90</code>, you can do: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git@aaa451b320fc115b3563ced25cb501301cf86f90#egg=toil[wdl,aws,google]' ==Frequently Asked Questions== ===I am getting warnings about <code>XDG_RUNTIME_DIR</code>=== You may be seeing warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code> You should upgrade Toil. [https://github.com/DataBiosphere/toil/commit/ff6bf60ab798a675c20156c749817c4313644b96 Since Toil 6.1.0], Toil no longer issues this warning, and just puts up with bad <code>XDG_RUNTIME_DIR</code> settings. ===Toil said it was <code>Redirecting logging</code> somewhere, but I can't find that file!=== The Toil worker process for each job will say that it is <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code>, and when running in single machine mode these messages go to the main Toil log. The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil. If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node. ===How do I delete files in WDL?=== WDL doesn't have a built-in way to delete files; if you run a task that deletes a file, it will still exist in Toil's job store storage. Toil [https://github.com/DataBiosphere/toil/commit/2de6eea2cc2e688b53062a98687445f0cca56669 recently gained support] for deleting files at the 'end' of WDL workflows. So if you have a large file that you only need for part of your workflow, consider writing that part as a separate sub-<code>workflow</code> and invoking it with <code>call</code>. Then the file will be cleaned up when the child workflow ends, leaving more space fro files created in the parent workflow. =Additional WDL resources= For more information on writing and running WDL workflows, see: * [https://docs.openwdl.org/en/stable/ The WDL dcoumentation] * [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube] * [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification] ad50c691af2e79d9290649a23955888818d42268 664 663 2025-03-18T18:43:13Z Anovak 4 /* How do I delete files in WDL? */ wikitext text/x-wiki '''Tutorial: Getting Started with WDL Workflows on Phoenix''' Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you. Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order. This tutorial will help you get started writing and running workflows. The '''Phoenix Cluster Setup''' section is specifically for the UC Santa Cruz Genomics Institute's Phoenix Slurm cluster. The other sections are broadly applicable to other environments. By the end, you will be able to run workflows on Slurm with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong. =Phoenix Cluster Setup= Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH. ==Getting VPN access== We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall. To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster. ==Connecting to Phoenix== Once you have VPN access, you can connect to any of the machines with access to the Phoenix cluster. These interactive nodes are fairly large machines that can do some work locally, but you will still want to run larger workflows on the actual cluster. For this tutorial, we will use <code>emerald.prism</code> as our login node. To connect to the cluster: 1. Connect to the VPN. 2. SSH to <code>emerald.prism</code>. At the command line, run: ssh emerald.prism If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run: ssh flastname@emerald.prism The first time you connect, you will see a message like: The authenticity of host 'emerald.prism (10.50.1.67)' can't be established. ED25519 key fingerprint is SHA256:8hJQShO6jhrym9UVyMldKsKOnOFtWRChgjK5cZNhkAI. This key is not known by any other names. Are you sure you want to continue connecting (yes/no/[fingerprint])? This is your computer asking you to help it decide if it is talking to the genuine <code>emerald.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:8hJQShO6jhrym9UVyMldKsKOnOFtWRChgjK5cZNhkAI.</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is. ==Installing Toil with WDL support== Once you are on the head node, you can install Toil, a program for running workflows. Toil is written in Python, and the modern way to install Python command line tools is with pipx. So [https://pipx.pypa.io/latest/installation/ install pipx]: python3 -m pip install --user pipx python3 -m pipx ensurepath This may instruct you to **log out and log back in** or take some other action to adopt the new <code>PATH</code> settings. When installing Toil, you need to specify that you want WDL support. To do this, you can run: pipx install 'toil[wdl]' If you also want to use AWS S3 <code>s3://</code> and/or Google <code>gs://</code> URLs for data, you will need to also install Toil with the <code>aws</code> and <code>google</code> extras, respectively: pipx install 'toil[wdl,aws,google]' To change what extras are used when you have an existing Toil installation, you will need to use the <code>--force</code> option. This will install Toil in the <code>.local</code> directory inside your home directory, which we write as <code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>. The <code>python3 -m pipx ensurepath</code> command should have added the <code>~/.local</code> directory to your <code>PATH</code> environment variable, to ensure you can find these commands. If you see something from <code>pipx</code> like: - cwltoil (symlink missing or pointing to unexpected location) Then <code>pipx uninstall toil</code>, remove the offending file from <code>~/.local/bin</code>, and try again. To make sure it worked, you can run: toil-wdl-runner --help If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports. If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above. ==Configuring your Phoenix Environment== '''Do not try and store data in your home directory on Phoenix!''' The home directories are meant for code and programs. Any data worth running a workflow on should be in a directory under <code>/private/groups</code>. You will probably need to email the admins to get added to a group so you can create a directory to work in somewhere under <code>/private/groups</code>. Usually you would end up with <code>/private/groups/YOURGROUPNAME/YOURUSERNAME</code>. Remember this path; we will need it later. ==Configuring Toil for Phoenix== Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. However, since these files can be large, and the home directory quota is only 30 GB, we might not be able to keep these in your home directory. We would like to be able to store these on the cluster's large storage array, under <code>/private/groups</code>. However, Toil needs to use file locks in these directories to prevent simultaneous Singularity calls from producing internal Singularity errors, and Ceph currently has [https://tracker.ceph.com/issues/65607 a bug where these file locking operations can freeze the Ceph servers]. If you have '''a small number of container images''' that will fit in your home directory, you can keep them there. [https://github.com/DataBiosphere/toil/commit/cb0b291bb7f6212bfe69221dd9f09d72f83e92fb Since Toil 6.1.0], this is the default behavior and you don't need to do anything. (Unless you previously set <code>SINGULARITY_CACHEDIR</code> or <code>MINIWDL__SINGULARITY__IMAGE_CACHE</code>, in which case you need to unset them.) '''If you don't have room in your home directory''' for container images, currently the recommended approach is to use node-local storage under <code>/data/tmp</code>. This results in each node pulling each container image, but images will be saved across workflows. You can set that up for all your workflows with: echo 'export SINGULARITY_CACHEDIR="/data/tmp/$(whoami)/cache/singularity"' >>~/.bashrc echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="/data/tmp/$(whoami)/cache/miniwdl"' >>~/.bashrc Then '''log out and log back in again''', to apply the changes. =Running an existing workflow= First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project]. First, go to your user directory under <code>/private/groups</code>, and make a directory to work in. cd /private/groups/YOURGROUPNAME/YOURUSERNAME mkdir workflow-test cd workflow-test Next, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally. wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl ==Preparing an input file== Near the top of the WDL file, there's a section like this: workflow hello_caller { input { File who } This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one. So first, we have to make that list of names. Let's make it in <code>names.txt</code> echo "Mridula Resurrección" >names.txt echo "Gershom Šarlota" >>names.txt echo "Ritchie Ravi" >>names.txt Then, we need to create an ''inputs file'', which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this: echo '{"hello_caller.who": "./names.txt"}' >inputs.json Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification]. ==Testing at small scale single-machine== We are now ready to run the workflow! You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running: srun -c 2 --mem 8G --time=02:00:00 --partition=medium --pty bash -i This will start a new shell that can run for 2 hours; to leave it and go back to the head node you can use <code>exit</code>. In your new shell, run this Toil command: toil-wdl-runner self_test.wdl inputs.json -o local_run This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>. This will print a lot of logging to standard error, and to standard output it will print: {"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]} The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person. To leave your interactive Slurm session and return to the head node, use <code>exit</code>. ==Running at larger scale== Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people! Go get this handy list of people and cut it to length: wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt head -n100 1000_names.txt >100_names.txt And make a new inputs file: echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster. To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. Until Toil [https://github.com/DataBiosphere/toil/issues/4775 gets support for data file caching on Slurm], we will also need the <code>--caching false</code> option. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it. Additionally, since [https://github.com/DataBiosphere/toil/issues/4686 Toil can't manage Slurm partitions itself], we will use the <code>TOIL_SLURM_ARGS</code> environment variable to tell Toil how long jobs should be allowed to run for (2 hours) and what [[Slurm Queues (Partitions) and Resource Management | partition]] they should go in. mkdir -p logs export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium" toil-wdl-runner --jobStore ./big_store --batchSystem slurm --caching false --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory. =Writing your own workflow= In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz]. ==Writing the file== ===Version=== All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files). Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0. So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement: version 1.0 ===Workflow Block=== Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>. version 1.0 workflow FizzBuzz { } ===Input Block=== Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } } Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is ''not'' optional, and there is no default value, then the user's inputs file ''must'' specify a value for it in order for the workflow to run. ===Body=== Now we'll start on the body of the workflow, to be inserted just after the inputs section. The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable. Array[Int] numbers = range(item_count) WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more. ===Scattering=== Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 } ===Conditionals=== Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array. Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only ''once'' in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>. Note that WDL supports conditional ''expressions'' with a <code>then</code> and an <code>else</code>, but conditional ''statements'' only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition. So first, let's handle the special cases. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. } } ===Calling Tasks=== Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later. To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]) } We can put the code into the workflow now, and set about writing the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } ===Writing Tasks=== Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>. task stringify_number { } We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections. task stringify_number { input { Int the_number } # ??? output { String the_string # = ??? } } Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string # = ??? } } Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]). task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } } We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage. Then we can put our task into our WDL file: version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } ===Output Block=== Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren't supposed to need this, but you do need it in 1.1 and Toil doesn't actually send your outputs anywhere yet if you don't have one, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } output { Array[String] fizzbuzz_results = result } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array. ==Running the Workflow== Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs: echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json Then run it on the cluster with Toil: mkdir -p logs export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium" toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --caching false --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json Or locally: toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json =Debugging Workflows= Sometimes, your workflow won't work. Try these ideas for figuring out what is going wrong. ==Restarting the Workflow== If you think your workflow failed from a transient problem (such as a Docker image not being available) that you have fixed, and you ran the workflow with <code>--jobStore</code> set manually to a directory that persists between attempts, you can add <code>--restart</code> to your workflow command and make Toil try again. It will pick up from where it left off and rerun any failed tasks and then the rest of the workflow. This will ''will not'' pick up any changes to your WDL source code files; those are read once at the beginning and not re-read on restart. If restarting the workflow doesn't help, you may need to move on to more advanced debugging techniques. ==Debugging Options== When debugging a workflow, make sure to run the workflow with <code>--logDebug</code>, to set the log level to <code>DEBUG</code>, and with <code>--jobStore /some/path/to/a/shared/directory/it/can/create</code> so that the stored files shipped between jobs are in a place you can access them. When debug logging is on, the log from every Toil job is inserted in the main Toil log between these markers: =========> Toil job log is here <========= Normally, only the logs of failing jobs and the output of commands run from WDL are reproduced like this. ==Reading the Log== When a WDL workflow fails, you are likely to see a message like this: WDL.runtime.error.CommandFailed: task command failed with exit status 1 [2023-07-16T16:23:54-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host phoenix-15.prism This means that the command line command specified by one of your WDL tasks exited with a failing (i.e. nonzero) exit code, which will happen when either the command line command is written wrong, or when the error detection code in the tool you are trying to run detects and reports an error. Go up higher in the log until you find lines that look like: [2024-01-16T20:12:19-0500] [Thread-3 (statsAndLoggingAggregator)] [I] [toil.statsAndLogging] hello_caller.0.hello.stderr follows: And [2024-01-16T20:12:19-0500] [Thread-3 (statsAndLoggingAggregator)] [I] [toil.statsAndLogging] hello_caller.0.hello.stdout follows: These will be followed by the standard error and standard output log data from the task's command. There may be useful information (such as an error message from the underlying tool) in there. If you would like individual task logs to be saved separately for later reference, you can use the <code>--writeLogs</code> option to specify a directory to store them. For more information, see [https://toil.readthedocs.io/en/latest/wdl/running.html#managing-workflow-logs the Toil documentation of workflow task logs]. ==Reproducing Problems== When trying to fix a failing step, it is useful to be able to run a command outside of Toil or WDL that might reproduce the problem. In addition to getting the standard output and standard error logs as described above, you may also need input files for your tool in order to do this. === Automatically Fetching Input Files === The <code>toil debug-job</code> command has a <code>--retrieveTaskDirectory</code> option that lets you dump out a directory with all the files that a failing WDL task would use. You can use it like: toil debug-job ./jobstore WDLTaskJob --retrieveTaskDirectory dumpdir If there are multiple failing tasks, you might need to replace <code>WDLTaskJob</code> with the name of one of the failing jobs. See [https://toil.readthedocs.io/en/latest/running/debugging.html#fetching-job-inputs the Toil documentation on retrieving files] for more on how to use this command. === Manually Finding Input Files === If you can't use <code>toil debug-job</code>, you might need to manually dig through the job store for files. In the log of your failing Toil task, look for lines like this: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-4f886176ab8344baaf17dc72fc445445/toplog.sh' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmprwhi6h3q/toplog.sh' [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmpjyksfoko/Sample.bam' ... The <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam</code> part is a Toil file ID, and it is a relative path from your <code>--jobStore</code> value to where the file is stored on disk. So if you ran the workflow with <code>--jobStore /private/groups/patenlab/anovak/jobstore</code>, you would look for this file at: /private/groups/patenlab/anovak/jobstore/files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam ==More Ways of Finding Files== Sometimes, a step might not fail, but you still might want to see the files it is using as input. If you have the job store path, you can use the <code>find</code> command to try and find the files by name. For example, if you want to look at <code>Sample.bam</code>, you can look for it like this: find /path/to/the/jobstore -name "Sample.bam" If you want to find files that were ''uploaded'' from a job, look for lines like this in the job's log: [2023-07-16T15:58:39-0700] [MainThread] [D] [toil.wdl.wdltoil] Virtualized /data/tmp/2846b6012e3e5535add03b363950dd78/cb23/197c/work/bamPerChrs/Sample.chr14.bam as WDL file toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam You can take the <code>toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam</code> URI and URL-decode it with, for example, [https://www.urldecoder.io/], getting this: toilfile:2703483274:0:files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam Then you can take the part after the last colon, <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam</code>, and that is the path relative to the job store where this file can be found. ==Using Development Versions of Toil== Sometimes, bugs will be fixed in the development version of Toil, but not released yet. To try the current development version of Toil, you can install it like this: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl,aws,google]' If you want to use a particular branch or commit, like <code>aaa451b320fc115b3563ced25cb501301cf86f90</code>, you can do: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git@aaa451b320fc115b3563ced25cb501301cf86f90#egg=toil[wdl,aws,google]' ==Frequently Asked Questions== ===I am getting warnings about <code>XDG_RUNTIME_DIR</code>=== You may be seeing warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code> You should upgrade Toil. [https://github.com/DataBiosphere/toil/commit/ff6bf60ab798a675c20156c749817c4313644b96 Since Toil 6.1.0], Toil no longer issues this warning, and just puts up with bad <code>XDG_RUNTIME_DIR</code> settings. ===Toil said it was <code>Redirecting logging</code> somewhere, but I can't find that file!=== The Toil worker process for each job will say that it is <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code>, and when running in single machine mode these messages go to the main Toil log. The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil. If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node. ===How do I delete files in WDL?=== WDL doesn't have a built-in way to delete files; if you run a task that deletes a file, it will still exist in Toil's job store storage. Toil [https://github.com/DataBiosphere/toil/commit/2de6eea2cc2e688b53062a98687445f0cca56669 recently gained support] for deleting files at the ''end'' of WDL workflows. So if you have a large file that you only need for part of your workflow, consider writing that part as a separate sub-<code>workflow</code> and invoking it with <code>call</code>. Then the file will be cleaned up when the child workflow ends, leaving more space fro files created in the parent workflow. =Additional WDL resources= For more information on writing and running WDL workflows, see: * [https://docs.openwdl.org/en/stable/ The WDL dcoumentation] * [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube] * [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification] 8ee3020ce5c7efef0f7b283fc471e4d51307be37 665 664 2025-03-18T18:43:57Z Anovak 4 /* How do I delete files in WDL? */ wikitext text/x-wiki '''Tutorial: Getting Started with WDL Workflows on Phoenix''' Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you. Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order. This tutorial will help you get started writing and running workflows. The '''Phoenix Cluster Setup''' section is specifically for the UC Santa Cruz Genomics Institute's Phoenix Slurm cluster. The other sections are broadly applicable to other environments. By the end, you will be able to run workflows on Slurm with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong. =Phoenix Cluster Setup= Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH. ==Getting VPN access== We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall. To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster. ==Connecting to Phoenix== Once you have VPN access, you can connect to any of the machines with access to the Phoenix cluster. These interactive nodes are fairly large machines that can do some work locally, but you will still want to run larger workflows on the actual cluster. For this tutorial, we will use <code>emerald.prism</code> as our login node. To connect to the cluster: 1. Connect to the VPN. 2. SSH to <code>emerald.prism</code>. At the command line, run: ssh emerald.prism If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run: ssh flastname@emerald.prism The first time you connect, you will see a message like: The authenticity of host 'emerald.prism (10.50.1.67)' can't be established. ED25519 key fingerprint is SHA256:8hJQShO6jhrym9UVyMldKsKOnOFtWRChgjK5cZNhkAI. This key is not known by any other names. Are you sure you want to continue connecting (yes/no/[fingerprint])? This is your computer asking you to help it decide if it is talking to the genuine <code>emerald.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:8hJQShO6jhrym9UVyMldKsKOnOFtWRChgjK5cZNhkAI.</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is. ==Installing Toil with WDL support== Once you are on the head node, you can install Toil, a program for running workflows. Toil is written in Python, and the modern way to install Python command line tools is with pipx. So [https://pipx.pypa.io/latest/installation/ install pipx]: python3 -m pip install --user pipx python3 -m pipx ensurepath This may instruct you to **log out and log back in** or take some other action to adopt the new <code>PATH</code> settings. When installing Toil, you need to specify that you want WDL support. To do this, you can run: pipx install 'toil[wdl]' If you also want to use AWS S3 <code>s3://</code> and/or Google <code>gs://</code> URLs for data, you will need to also install Toil with the <code>aws</code> and <code>google</code> extras, respectively: pipx install 'toil[wdl,aws,google]' To change what extras are used when you have an existing Toil installation, you will need to use the <code>--force</code> option. This will install Toil in the <code>.local</code> directory inside your home directory, which we write as <code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>. The <code>python3 -m pipx ensurepath</code> command should have added the <code>~/.local</code> directory to your <code>PATH</code> environment variable, to ensure you can find these commands. If you see something from <code>pipx</code> like: - cwltoil (symlink missing or pointing to unexpected location) Then <code>pipx uninstall toil</code>, remove the offending file from <code>~/.local/bin</code>, and try again. To make sure it worked, you can run: toil-wdl-runner --help If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports. If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above. ==Configuring your Phoenix Environment== '''Do not try and store data in your home directory on Phoenix!''' The home directories are meant for code and programs. Any data worth running a workflow on should be in a directory under <code>/private/groups</code>. You will probably need to email the admins to get added to a group so you can create a directory to work in somewhere under <code>/private/groups</code>. Usually you would end up with <code>/private/groups/YOURGROUPNAME/YOURUSERNAME</code>. Remember this path; we will need it later. ==Configuring Toil for Phoenix== Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. However, since these files can be large, and the home directory quota is only 30 GB, we might not be able to keep these in your home directory. We would like to be able to store these on the cluster's large storage array, under <code>/private/groups</code>. However, Toil needs to use file locks in these directories to prevent simultaneous Singularity calls from producing internal Singularity errors, and Ceph currently has [https://tracker.ceph.com/issues/65607 a bug where these file locking operations can freeze the Ceph servers]. If you have '''a small number of container images''' that will fit in your home directory, you can keep them there. [https://github.com/DataBiosphere/toil/commit/cb0b291bb7f6212bfe69221dd9f09d72f83e92fb Since Toil 6.1.0], this is the default behavior and you don't need to do anything. (Unless you previously set <code>SINGULARITY_CACHEDIR</code> or <code>MINIWDL__SINGULARITY__IMAGE_CACHE</code>, in which case you need to unset them.) '''If you don't have room in your home directory''' for container images, currently the recommended approach is to use node-local storage under <code>/data/tmp</code>. This results in each node pulling each container image, but images will be saved across workflows. You can set that up for all your workflows with: echo 'export SINGULARITY_CACHEDIR="/data/tmp/$(whoami)/cache/singularity"' >>~/.bashrc echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="/data/tmp/$(whoami)/cache/miniwdl"' >>~/.bashrc Then '''log out and log back in again''', to apply the changes. =Running an existing workflow= First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project]. First, go to your user directory under <code>/private/groups</code>, and make a directory to work in. cd /private/groups/YOURGROUPNAME/YOURUSERNAME mkdir workflow-test cd workflow-test Next, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally. wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl ==Preparing an input file== Near the top of the WDL file, there's a section like this: workflow hello_caller { input { File who } This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one. So first, we have to make that list of names. Let's make it in <code>names.txt</code> echo "Mridula Resurrección" >names.txt echo "Gershom Šarlota" >>names.txt echo "Ritchie Ravi" >>names.txt Then, we need to create an ''inputs file'', which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this: echo '{"hello_caller.who": "./names.txt"}' >inputs.json Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification]. ==Testing at small scale single-machine== We are now ready to run the workflow! You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running: srun -c 2 --mem 8G --time=02:00:00 --partition=medium --pty bash -i This will start a new shell that can run for 2 hours; to leave it and go back to the head node you can use <code>exit</code>. In your new shell, run this Toil command: toil-wdl-runner self_test.wdl inputs.json -o local_run This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>. This will print a lot of logging to standard error, and to standard output it will print: {"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]} The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person. To leave your interactive Slurm session and return to the head node, use <code>exit</code>. ==Running at larger scale== Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people! Go get this handy list of people and cut it to length: wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt head -n100 1000_names.txt >100_names.txt And make a new inputs file: echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster. To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. Until Toil [https://github.com/DataBiosphere/toil/issues/4775 gets support for data file caching on Slurm], we will also need the <code>--caching false</code> option. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it. Additionally, since [https://github.com/DataBiosphere/toil/issues/4686 Toil can't manage Slurm partitions itself], we will use the <code>TOIL_SLURM_ARGS</code> environment variable to tell Toil how long jobs should be allowed to run for (2 hours) and what [[Slurm Queues (Partitions) and Resource Management | partition]] they should go in. mkdir -p logs export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium" toil-wdl-runner --jobStore ./big_store --batchSystem slurm --caching false --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory. =Writing your own workflow= In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz]. ==Writing the file== ===Version=== All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files). Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0. So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement: version 1.0 ===Workflow Block=== Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>. version 1.0 workflow FizzBuzz { } ===Input Block=== Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } } Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is ''not'' optional, and there is no default value, then the user's inputs file ''must'' specify a value for it in order for the workflow to run. ===Body=== Now we'll start on the body of the workflow, to be inserted just after the inputs section. The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable. Array[Int] numbers = range(item_count) WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more. ===Scattering=== Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 } ===Conditionals=== Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array. Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only ''once'' in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>. Note that WDL supports conditional ''expressions'' with a <code>then</code> and an <code>else</code>, but conditional ''statements'' only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition. So first, let's handle the special cases. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. } } ===Calling Tasks=== Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later. To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]) } We can put the code into the workflow now, and set about writing the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } ===Writing Tasks=== Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>. task stringify_number { } We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections. task stringify_number { input { Int the_number } # ??? output { String the_string # = ??? } } Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string # = ??? } } Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]). task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } } We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage. Then we can put our task into our WDL file: version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } ===Output Block=== Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren't supposed to need this, but you do need it in 1.1 and Toil doesn't actually send your outputs anywhere yet if you don't have one, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } output { Array[String] fizzbuzz_results = result } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array. ==Running the Workflow== Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs: echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json Then run it on the cluster with Toil: mkdir -p logs export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium" toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --caching false --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json Or locally: toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json =Debugging Workflows= Sometimes, your workflow won't work. Try these ideas for figuring out what is going wrong. ==Restarting the Workflow== If you think your workflow failed from a transient problem (such as a Docker image not being available) that you have fixed, and you ran the workflow with <code>--jobStore</code> set manually to a directory that persists between attempts, you can add <code>--restart</code> to your workflow command and make Toil try again. It will pick up from where it left off and rerun any failed tasks and then the rest of the workflow. This will ''will not'' pick up any changes to your WDL source code files; those are read once at the beginning and not re-read on restart. If restarting the workflow doesn't help, you may need to move on to more advanced debugging techniques. ==Debugging Options== When debugging a workflow, make sure to run the workflow with <code>--logDebug</code>, to set the log level to <code>DEBUG</code>, and with <code>--jobStore /some/path/to/a/shared/directory/it/can/create</code> so that the stored files shipped between jobs are in a place you can access them. When debug logging is on, the log from every Toil job is inserted in the main Toil log between these markers: =========> Toil job log is here <========= Normally, only the logs of failing jobs and the output of commands run from WDL are reproduced like this. ==Reading the Log== When a WDL workflow fails, you are likely to see a message like this: WDL.runtime.error.CommandFailed: task command failed with exit status 1 [2023-07-16T16:23:54-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host phoenix-15.prism This means that the command line command specified by one of your WDL tasks exited with a failing (i.e. nonzero) exit code, which will happen when either the command line command is written wrong, or when the error detection code in the tool you are trying to run detects and reports an error. Go up higher in the log until you find lines that look like: [2024-01-16T20:12:19-0500] [Thread-3 (statsAndLoggingAggregator)] [I] [toil.statsAndLogging] hello_caller.0.hello.stderr follows: And [2024-01-16T20:12:19-0500] [Thread-3 (statsAndLoggingAggregator)] [I] [toil.statsAndLogging] hello_caller.0.hello.stdout follows: These will be followed by the standard error and standard output log data from the task's command. There may be useful information (such as an error message from the underlying tool) in there. If you would like individual task logs to be saved separately for later reference, you can use the <code>--writeLogs</code> option to specify a directory to store them. For more information, see [https://toil.readthedocs.io/en/latest/wdl/running.html#managing-workflow-logs the Toil documentation of workflow task logs]. ==Reproducing Problems== When trying to fix a failing step, it is useful to be able to run a command outside of Toil or WDL that might reproduce the problem. In addition to getting the standard output and standard error logs as described above, you may also need input files for your tool in order to do this. === Automatically Fetching Input Files === The <code>toil debug-job</code> command has a <code>--retrieveTaskDirectory</code> option that lets you dump out a directory with all the files that a failing WDL task would use. You can use it like: toil debug-job ./jobstore WDLTaskJob --retrieveTaskDirectory dumpdir If there are multiple failing tasks, you might need to replace <code>WDLTaskJob</code> with the name of one of the failing jobs. See [https://toil.readthedocs.io/en/latest/running/debugging.html#fetching-job-inputs the Toil documentation on retrieving files] for more on how to use this command. === Manually Finding Input Files === If you can't use <code>toil debug-job</code>, you might need to manually dig through the job store for files. In the log of your failing Toil task, look for lines like this: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-4f886176ab8344baaf17dc72fc445445/toplog.sh' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmprwhi6h3q/toplog.sh' [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmpjyksfoko/Sample.bam' ... The <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam</code> part is a Toil file ID, and it is a relative path from your <code>--jobStore</code> value to where the file is stored on disk. So if you ran the workflow with <code>--jobStore /private/groups/patenlab/anovak/jobstore</code>, you would look for this file at: /private/groups/patenlab/anovak/jobstore/files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam ==More Ways of Finding Files== Sometimes, a step might not fail, but you still might want to see the files it is using as input. If you have the job store path, you can use the <code>find</code> command to try and find the files by name. For example, if you want to look at <code>Sample.bam</code>, you can look for it like this: find /path/to/the/jobstore -name "Sample.bam" If you want to find files that were ''uploaded'' from a job, look for lines like this in the job's log: [2023-07-16T15:58:39-0700] [MainThread] [D] [toil.wdl.wdltoil] Virtualized /data/tmp/2846b6012e3e5535add03b363950dd78/cb23/197c/work/bamPerChrs/Sample.chr14.bam as WDL file toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam You can take the <code>toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam</code> URI and URL-decode it with, for example, [https://www.urldecoder.io/], getting this: toilfile:2703483274:0:files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam Then you can take the part after the last colon, <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam</code>, and that is the path relative to the job store where this file can be found. ==Using Development Versions of Toil== Sometimes, bugs will be fixed in the development version of Toil, but not released yet. To try the current development version of Toil, you can install it like this: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl,aws,google]' If you want to use a particular branch or commit, like <code>aaa451b320fc115b3563ced25cb501301cf86f90</code>, you can do: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git@aaa451b320fc115b3563ced25cb501301cf86f90#egg=toil[wdl,aws,google]' ==Frequently Asked Questions== ===I am getting warnings about <code>XDG_RUNTIME_DIR</code>=== You may be seeing warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code> You should upgrade Toil. [https://github.com/DataBiosphere/toil/commit/ff6bf60ab798a675c20156c749817c4313644b96 Since Toil 6.1.0], Toil no longer issues this warning, and just puts up with bad <code>XDG_RUNTIME_DIR</code> settings. ===Toil said it was <code>Redirecting logging</code> somewhere, but I can't find that file!=== The Toil worker process for each job will say that it is <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code>, and when running in single machine mode these messages go to the main Toil log. The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil. If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node. ===How do I delete files in WDL?=== WDL doesn't have a built-in way to delete files; if you run a task that deletes a file, it will still exist in Toil's job store storage. Toil [https://github.com/DataBiosphere/toil/commit/2de6eea2cc2e688b53062a98687445f0cca56669 recently gained support] for deleting files at the ''end'' of WDL workflows. So if you have a large file that you only need for part of your workflow, consider writing the part that creates and uses it as a separate sub-<code>workflow</code> and invoking it with <code>call</code>. Then the file will be cleaned up when the child workflow ends, leaving more space for files created in the parent workflow. =Additional WDL resources= For more information on writing and running WDL workflows, see: * [https://docs.openwdl.org/en/stable/ The WDL dcoumentation] * [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube] * [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification] 9d70b51c4d35c9688e28b394a4c1dc5e875a63c8 666 665 2025-03-18T20:46:04Z Anovak 4 /* Writing your own workflow */ wikitext text/x-wiki '''Tutorial: Getting Started with WDL Workflows on Phoenix''' Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you. Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order. This tutorial will help you get started writing and running workflows. The '''Phoenix Cluster Setup''' section is specifically for the UC Santa Cruz Genomics Institute's Phoenix Slurm cluster. The other sections are broadly applicable to other environments. By the end, you will be able to run workflows on Slurm with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong. =Phoenix Cluster Setup= Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH. ==Getting VPN access== We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall. To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster. ==Connecting to Phoenix== Once you have VPN access, you can connect to any of the machines with access to the Phoenix cluster. These interactive nodes are fairly large machines that can do some work locally, but you will still want to run larger workflows on the actual cluster. For this tutorial, we will use <code>emerald.prism</code> as our login node. To connect to the cluster: 1. Connect to the VPN. 2. SSH to <code>emerald.prism</code>. At the command line, run: ssh emerald.prism If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run: ssh flastname@emerald.prism The first time you connect, you will see a message like: The authenticity of host 'emerald.prism (10.50.1.67)' can't be established. ED25519 key fingerprint is SHA256:8hJQShO6jhrym9UVyMldKsKOnOFtWRChgjK5cZNhkAI. This key is not known by any other names. Are you sure you want to continue connecting (yes/no/[fingerprint])? This is your computer asking you to help it decide if it is talking to the genuine <code>emerald.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:8hJQShO6jhrym9UVyMldKsKOnOFtWRChgjK5cZNhkAI.</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is. ==Installing Toil with WDL support== Once you are on the head node, you can install Toil, a program for running workflows. Toil is written in Python, and the modern way to install Python command line tools is with pipx. So [https://pipx.pypa.io/latest/installation/ install pipx]: python3 -m pip install --user pipx python3 -m pipx ensurepath This may instruct you to **log out and log back in** or take some other action to adopt the new <code>PATH</code> settings. When installing Toil, you need to specify that you want WDL support. To do this, you can run: pipx install 'toil[wdl]' If you also want to use AWS S3 <code>s3://</code> and/or Google <code>gs://</code> URLs for data, you will need to also install Toil with the <code>aws</code> and <code>google</code> extras, respectively: pipx install 'toil[wdl,aws,google]' To change what extras are used when you have an existing Toil installation, you will need to use the <code>--force</code> option. This will install Toil in the <code>.local</code> directory inside your home directory, which we write as <code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>. The <code>python3 -m pipx ensurepath</code> command should have added the <code>~/.local</code> directory to your <code>PATH</code> environment variable, to ensure you can find these commands. If you see something from <code>pipx</code> like: - cwltoil (symlink missing or pointing to unexpected location) Then <code>pipx uninstall toil</code>, remove the offending file from <code>~/.local/bin</code>, and try again. To make sure it worked, you can run: toil-wdl-runner --help If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports. If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above. ==Configuring your Phoenix Environment== '''Do not try and store data in your home directory on Phoenix!''' The home directories are meant for code and programs. Any data worth running a workflow on should be in a directory under <code>/private/groups</code>. You will probably need to email the admins to get added to a group so you can create a directory to work in somewhere under <code>/private/groups</code>. Usually you would end up with <code>/private/groups/YOURGROUPNAME/YOURUSERNAME</code>. Remember this path; we will need it later. ==Configuring Toil for Phoenix== Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. However, since these files can be large, and the home directory quota is only 30 GB, we might not be able to keep these in your home directory. We would like to be able to store these on the cluster's large storage array, under <code>/private/groups</code>. However, Toil needs to use file locks in these directories to prevent simultaneous Singularity calls from producing internal Singularity errors, and Ceph currently has [https://tracker.ceph.com/issues/65607 a bug where these file locking operations can freeze the Ceph servers]. If you have '''a small number of container images''' that will fit in your home directory, you can keep them there. [https://github.com/DataBiosphere/toil/commit/cb0b291bb7f6212bfe69221dd9f09d72f83e92fb Since Toil 6.1.0], this is the default behavior and you don't need to do anything. (Unless you previously set <code>SINGULARITY_CACHEDIR</code> or <code>MINIWDL__SINGULARITY__IMAGE_CACHE</code>, in which case you need to unset them.) '''If you don't have room in your home directory''' for container images, currently the recommended approach is to use node-local storage under <code>/data/tmp</code>. This results in each node pulling each container image, but images will be saved across workflows. You can set that up for all your workflows with: echo 'export SINGULARITY_CACHEDIR="/data/tmp/$(whoami)/cache/singularity"' >>~/.bashrc echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="/data/tmp/$(whoami)/cache/miniwdl"' >>~/.bashrc Then '''log out and log back in again''', to apply the changes. =Running an existing workflow= First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project]. First, go to your user directory under <code>/private/groups</code>, and make a directory to work in. cd /private/groups/YOURGROUPNAME/YOURUSERNAME mkdir workflow-test cd workflow-test Next, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally. wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl ==Preparing an input file== Near the top of the WDL file, there's a section like this: workflow hello_caller { input { File who } This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one. So first, we have to make that list of names. Let's make it in <code>names.txt</code> echo "Mridula Resurrección" >names.txt echo "Gershom Šarlota" >>names.txt echo "Ritchie Ravi" >>names.txt Then, we need to create an ''inputs file'', which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this: echo '{"hello_caller.who": "./names.txt"}' >inputs.json Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification]. ==Testing at small scale single-machine== We are now ready to run the workflow! You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running: srun -c 2 --mem 8G --time=02:00:00 --partition=medium --pty bash -i This will start a new shell that can run for 2 hours; to leave it and go back to the head node you can use <code>exit</code>. In your new shell, run this Toil command: toil-wdl-runner self_test.wdl inputs.json -o local_run This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>. This will print a lot of logging to standard error, and to standard output it will print: {"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]} The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person. To leave your interactive Slurm session and return to the head node, use <code>exit</code>. ==Running at larger scale== Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people! Go get this handy list of people and cut it to length: wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt head -n100 1000_names.txt >100_names.txt And make a new inputs file: echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster. To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. Until Toil [https://github.com/DataBiosphere/toil/issues/4775 gets support for data file caching on Slurm], we will also need the <code>--caching false</code> option. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it. Additionally, since [https://github.com/DataBiosphere/toil/issues/4686 Toil can't manage Slurm partitions itself], we will use the <code>TOIL_SLURM_ARGS</code> environment variable to tell Toil how long jobs should be allowed to run for (2 hours) and what [[Slurm Queues (Partitions) and Resource Management | partition]] they should go in. mkdir -p logs export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium" toil-wdl-runner --jobStore ./big_store --batchSystem slurm --caching false --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory. =Writing your own workflow= In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz]. ==Writing the file== ===Version=== All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files). Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0. So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement: version 1.0 ===Workflow Block=== Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>. version 1.0 workflow FizzBuzz { } ===Input Block=== Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } } Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is ''not'' optional, and there is no default value, then the user's inputs file ''must'' specify a value for it in order for the workflow to run. ===Body=== Now we'll start on the body of the workflow, to be inserted just after the inputs section. The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable. Array[Int] numbers = range(item_count) WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more. ===Scattering=== Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 } ===Conditionals=== Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array. Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only ''once'' in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>. Note that WDL supports conditional ''expressions'' with a <code>then</code> and an <code>else</code>, but conditional ''statements'' only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition. So first, let's handle the special cases. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. } } ===Calling Tasks=== Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later. To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]) } We can put the code into the workflow now, and set about writing the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } ===Writing Tasks=== Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>. task stringify_number { } We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections. task stringify_number { input { Int the_number } # ??? output { String the_string # = ??? } } Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string # = ??? } } Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]). task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } } We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage. Then we can put our task into our WDL file: version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } ===Output Block=== Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren't supposed to need this, but you do need it in 1.1 and Toil doesn't actually send your outputs anywhere yet if you don't have one, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } output { Array[String] fizzbuzz_results = result } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:22.04" } } Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array. ==Running the Workflow== Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs: echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json Then run it on the cluster with Toil: toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --slurmTime 00:10:00 --caching false --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json Or locally: toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json =Debugging Workflows= Sometimes, your workflow won't work. Try these ideas for figuring out what is going wrong. ==Restarting the Workflow== If you think your workflow failed from a transient problem (such as a Docker image not being available) that you have fixed, and you ran the workflow with <code>--jobStore</code> set manually to a directory that persists between attempts, you can add <code>--restart</code> to your workflow command and make Toil try again. It will pick up from where it left off and rerun any failed tasks and then the rest of the workflow. This will ''will not'' pick up any changes to your WDL source code files; those are read once at the beginning and not re-read on restart. If restarting the workflow doesn't help, you may need to move on to more advanced debugging techniques. ==Debugging Options== When debugging a workflow, make sure to run the workflow with <code>--logDebug</code>, to set the log level to <code>DEBUG</code>, and with <code>--jobStore /some/path/to/a/shared/directory/it/can/create</code> so that the stored files shipped between jobs are in a place you can access them. When debug logging is on, the log from every Toil job is inserted in the main Toil log between these markers: =========> Toil job log is here <========= Normally, only the logs of failing jobs and the output of commands run from WDL are reproduced like this. ==Reading the Log== When a WDL workflow fails, you are likely to see a message like this: WDL.runtime.error.CommandFailed: task command failed with exit status 1 [2023-07-16T16:23:54-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host phoenix-15.prism This means that the command line command specified by one of your WDL tasks exited with a failing (i.e. nonzero) exit code, which will happen when either the command line command is written wrong, or when the error detection code in the tool you are trying to run detects and reports an error. Go up higher in the log until you find lines that look like: [2024-01-16T20:12:19-0500] [Thread-3 (statsAndLoggingAggregator)] [I] [toil.statsAndLogging] hello_caller.0.hello.stderr follows: And [2024-01-16T20:12:19-0500] [Thread-3 (statsAndLoggingAggregator)] [I] [toil.statsAndLogging] hello_caller.0.hello.stdout follows: These will be followed by the standard error and standard output log data from the task's command. There may be useful information (such as an error message from the underlying tool) in there. If you would like individual task logs to be saved separately for later reference, you can use the <code>--writeLogs</code> option to specify a directory to store them. For more information, see [https://toil.readthedocs.io/en/latest/wdl/running.html#managing-workflow-logs the Toil documentation of workflow task logs]. ==Reproducing Problems== When trying to fix a failing step, it is useful to be able to run a command outside of Toil or WDL that might reproduce the problem. In addition to getting the standard output and standard error logs as described above, you may also need input files for your tool in order to do this. === Automatically Fetching Input Files === The <code>toil debug-job</code> command has a <code>--retrieveTaskDirectory</code> option that lets you dump out a directory with all the files that a failing WDL task would use. You can use it like: toil debug-job ./jobstore WDLTaskJob --retrieveTaskDirectory dumpdir If there are multiple failing tasks, you might need to replace <code>WDLTaskJob</code> with the name of one of the failing jobs. See [https://toil.readthedocs.io/en/latest/running/debugging.html#fetching-job-inputs the Toil documentation on retrieving files] for more on how to use this command. === Manually Finding Input Files === If you can't use <code>toil debug-job</code>, you might need to manually dig through the job store for files. In the log of your failing Toil task, look for lines like this: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-4f886176ab8344baaf17dc72fc445445/toplog.sh' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmprwhi6h3q/toplog.sh' [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmpjyksfoko/Sample.bam' ... The <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam</code> part is a Toil file ID, and it is a relative path from your <code>--jobStore</code> value to where the file is stored on disk. So if you ran the workflow with <code>--jobStore /private/groups/patenlab/anovak/jobstore</code>, you would look for this file at: /private/groups/patenlab/anovak/jobstore/files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam ==More Ways of Finding Files== Sometimes, a step might not fail, but you still might want to see the files it is using as input. If you have the job store path, you can use the <code>find</code> command to try and find the files by name. For example, if you want to look at <code>Sample.bam</code>, you can look for it like this: find /path/to/the/jobstore -name "Sample.bam" If you want to find files that were ''uploaded'' from a job, look for lines like this in the job's log: [2023-07-16T15:58:39-0700] [MainThread] [D] [toil.wdl.wdltoil] Virtualized /data/tmp/2846b6012e3e5535add03b363950dd78/cb23/197c/work/bamPerChrs/Sample.chr14.bam as WDL file toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam You can take the <code>toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam</code> URI and URL-decode it with, for example, [https://www.urldecoder.io/], getting this: toilfile:2703483274:0:files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam Then you can take the part after the last colon, <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam</code>, and that is the path relative to the job store where this file can be found. ==Using Development Versions of Toil== Sometimes, bugs will be fixed in the development version of Toil, but not released yet. To try the current development version of Toil, you can install it like this: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl,aws,google]' If you want to use a particular branch or commit, like <code>aaa451b320fc115b3563ced25cb501301cf86f90</code>, you can do: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git@aaa451b320fc115b3563ced25cb501301cf86f90#egg=toil[wdl,aws,google]' ==Frequently Asked Questions== ===I am getting warnings about <code>XDG_RUNTIME_DIR</code>=== You may be seeing warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code> You should upgrade Toil. [https://github.com/DataBiosphere/toil/commit/ff6bf60ab798a675c20156c749817c4313644b96 Since Toil 6.1.0], Toil no longer issues this warning, and just puts up with bad <code>XDG_RUNTIME_DIR</code> settings. ===Toil said it was <code>Redirecting logging</code> somewhere, but I can't find that file!=== The Toil worker process for each job will say that it is <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code>, and when running in single machine mode these messages go to the main Toil log. The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil. If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node. ===How do I delete files in WDL?=== WDL doesn't have a built-in way to delete files; if you run a task that deletes a file, it will still exist in Toil's job store storage. Toil [https://github.com/DataBiosphere/toil/commit/2de6eea2cc2e688b53062a98687445f0cca56669 recently gained support] for deleting files at the ''end'' of WDL workflows. So if you have a large file that you only need for part of your workflow, consider writing the part that creates and uses it as a separate sub-<code>workflow</code> and invoking it with <code>call</code>. Then the file will be cleaned up when the child workflow ends, leaving more space for files created in the parent workflow. =Additional WDL resources= For more information on writing and running WDL workflows, see: * [https://docs.openwdl.org/en/stable/ The WDL dcoumentation] * [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube] * [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification] b361486063943417639be0c93e439016ddacf435 667 666 2025-03-18T20:48:24Z Anovak 4 /* Writing the file */ wikitext text/x-wiki '''Tutorial: Getting Started with WDL Workflows on Phoenix''' Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you. Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order. This tutorial will help you get started writing and running workflows. The '''Phoenix Cluster Setup''' section is specifically for the UC Santa Cruz Genomics Institute's Phoenix Slurm cluster. The other sections are broadly applicable to other environments. By the end, you will be able to run workflows on Slurm with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong. =Phoenix Cluster Setup= Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH. ==Getting VPN access== We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall. To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster. ==Connecting to Phoenix== Once you have VPN access, you can connect to any of the machines with access to the Phoenix cluster. These interactive nodes are fairly large machines that can do some work locally, but you will still want to run larger workflows on the actual cluster. For this tutorial, we will use <code>emerald.prism</code> as our login node. To connect to the cluster: 1. Connect to the VPN. 2. SSH to <code>emerald.prism</code>. At the command line, run: ssh emerald.prism If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run: ssh flastname@emerald.prism The first time you connect, you will see a message like: The authenticity of host 'emerald.prism (10.50.1.67)' can't be established. ED25519 key fingerprint is SHA256:8hJQShO6jhrym9UVyMldKsKOnOFtWRChgjK5cZNhkAI. This key is not known by any other names. Are you sure you want to continue connecting (yes/no/[fingerprint])? This is your computer asking you to help it decide if it is talking to the genuine <code>emerald.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:8hJQShO6jhrym9UVyMldKsKOnOFtWRChgjK5cZNhkAI.</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is. ==Installing Toil with WDL support== Once you are on the head node, you can install Toil, a program for running workflows. Toil is written in Python, and the modern way to install Python command line tools is with pipx. So [https://pipx.pypa.io/latest/installation/ install pipx]: python3 -m pip install --user pipx python3 -m pipx ensurepath This may instruct you to **log out and log back in** or take some other action to adopt the new <code>PATH</code> settings. When installing Toil, you need to specify that you want WDL support. To do this, you can run: pipx install 'toil[wdl]' If you also want to use AWS S3 <code>s3://</code> and/or Google <code>gs://</code> URLs for data, you will need to also install Toil with the <code>aws</code> and <code>google</code> extras, respectively: pipx install 'toil[wdl,aws,google]' To change what extras are used when you have an existing Toil installation, you will need to use the <code>--force</code> option. This will install Toil in the <code>.local</code> directory inside your home directory, which we write as <code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>. The <code>python3 -m pipx ensurepath</code> command should have added the <code>~/.local</code> directory to your <code>PATH</code> environment variable, to ensure you can find these commands. If you see something from <code>pipx</code> like: - cwltoil (symlink missing or pointing to unexpected location) Then <code>pipx uninstall toil</code>, remove the offending file from <code>~/.local/bin</code>, and try again. To make sure it worked, you can run: toil-wdl-runner --help If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports. If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above. ==Configuring your Phoenix Environment== '''Do not try and store data in your home directory on Phoenix!''' The home directories are meant for code and programs. Any data worth running a workflow on should be in a directory under <code>/private/groups</code>. You will probably need to email the admins to get added to a group so you can create a directory to work in somewhere under <code>/private/groups</code>. Usually you would end up with <code>/private/groups/YOURGROUPNAME/YOURUSERNAME</code>. Remember this path; we will need it later. ==Configuring Toil for Phoenix== Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. However, since these files can be large, and the home directory quota is only 30 GB, we might not be able to keep these in your home directory. We would like to be able to store these on the cluster's large storage array, under <code>/private/groups</code>. However, Toil needs to use file locks in these directories to prevent simultaneous Singularity calls from producing internal Singularity errors, and Ceph currently has [https://tracker.ceph.com/issues/65607 a bug where these file locking operations can freeze the Ceph servers]. If you have '''a small number of container images''' that will fit in your home directory, you can keep them there. [https://github.com/DataBiosphere/toil/commit/cb0b291bb7f6212bfe69221dd9f09d72f83e92fb Since Toil 6.1.0], this is the default behavior and you don't need to do anything. (Unless you previously set <code>SINGULARITY_CACHEDIR</code> or <code>MINIWDL__SINGULARITY__IMAGE_CACHE</code>, in which case you need to unset them.) '''If you don't have room in your home directory''' for container images, currently the recommended approach is to use node-local storage under <code>/data/tmp</code>. This results in each node pulling each container image, but images will be saved across workflows. You can set that up for all your workflows with: echo 'export SINGULARITY_CACHEDIR="/data/tmp/$(whoami)/cache/singularity"' >>~/.bashrc echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="/data/tmp/$(whoami)/cache/miniwdl"' >>~/.bashrc Then '''log out and log back in again''', to apply the changes. =Running an existing workflow= First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project]. First, go to your user directory under <code>/private/groups</code>, and make a directory to work in. cd /private/groups/YOURGROUPNAME/YOURUSERNAME mkdir workflow-test cd workflow-test Next, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally. wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl ==Preparing an input file== Near the top of the WDL file, there's a section like this: workflow hello_caller { input { File who } This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one. So first, we have to make that list of names. Let's make it in <code>names.txt</code> echo "Mridula Resurrección" >names.txt echo "Gershom Šarlota" >>names.txt echo "Ritchie Ravi" >>names.txt Then, we need to create an ''inputs file'', which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this: echo '{"hello_caller.who": "./names.txt"}' >inputs.json Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification]. ==Testing at small scale single-machine== We are now ready to run the workflow! You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running: srun -c 2 --mem 8G --time=02:00:00 --partition=medium --pty bash -i This will start a new shell that can run for 2 hours; to leave it and go back to the head node you can use <code>exit</code>. In your new shell, run this Toil command: toil-wdl-runner self_test.wdl inputs.json -o local_run This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>. This will print a lot of logging to standard error, and to standard output it will print: {"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]} The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person. To leave your interactive Slurm session and return to the head node, use <code>exit</code>. ==Running at larger scale== Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people! Go get this handy list of people and cut it to length: wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt head -n100 1000_names.txt >100_names.txt And make a new inputs file: echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster. To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. Until Toil [https://github.com/DataBiosphere/toil/issues/4775 gets support for data file caching on Slurm], we will also need the <code>--caching false</code> option. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it. Additionally, since [https://github.com/DataBiosphere/toil/issues/4686 Toil can't manage Slurm partitions itself], we will use the <code>TOIL_SLURM_ARGS</code> environment variable to tell Toil how long jobs should be allowed to run for (2 hours) and what [[Slurm Queues (Partitions) and Resource Management | partition]] they should go in. mkdir -p logs export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium" toil-wdl-runner --jobStore ./big_store --batchSystem slurm --caching false --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory. =Writing your own workflow= In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz]. ==Writing the file== ===Version=== All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files). Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0. So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement: version 1.0 ===Workflow Block=== Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>. version 1.0 workflow FizzBuzz { } ===Input Block=== Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } } Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is ''not'' optional, and there is no default value, then the user's inputs file ''must'' specify a value for it in order for the workflow to run. ===Body=== Now we'll start on the body of the workflow, to be inserted just after the inputs section. The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable. Array[Int] numbers = range(item_count) WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more. ===Scattering=== Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 } ===Conditionals=== Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array. Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only ''once'' in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>. Note that WDL supports conditional ''expressions'' with a <code>then</code> and an <code>else</code>, but conditional ''statements'' only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition. So first, let's handle the special cases. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. } } ===Calling Tasks=== Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later. To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead. Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]) } We can put the code into the workflow now, and set about writing the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } ===Writing Tasks=== Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>. task stringify_number { } We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections. task stringify_number { input { Int the_number } # ??? output { String the_string # = ??? } } Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string # = ??? } } Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]). task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } } We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them. task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:24.04" } } The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage. Then we can put our task into our WDL file: version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:24.04" } } ===Output Block=== Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren't supposed to need this, but you do need it in 1.1 and Toil doesn't actually send your outputs anywhere yet if you don't have one, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task. version 1.0 workflow FizzBuzz { input { # How many FizzBuzz numbers do we want to make? Int item_count # Every multiple of this number, we produce "Fizz" Int to_fizz = 3 # Every multiple of this number, we produce "Buzz" Int to_buzz = 5 # Optional replacement for the string to print when a multiple of both String? fizzbuzz_override } Array[Int] numbers = range(item_count) scatter (i in numbers) { Int one_based = i + 1 if (one_based % to_fizz == 0) { String fizz = "Fizz" if (one_based % to_buzz == 0) { String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"]) } } if (one_based % to_buzz == 0) { String buzz = "Buzz" } if (one_based % to_fizz != 0 && one_based % to_buzz != 0) { # Just a normal number. call stringify_number { input: the_number = one_based } } String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string] } output { Array[String] fizzbuzz_results = result } } task stringify_number { input { Int the_number } command <<< # This is a Bash script. # So we should do good Bash script things like stop on errors set -e # Now print our number as a string echo ~{the_number} >>> output { String the_string = read_string(stdout()) } runtime { cpu: 1 memory: "0.5 GB" disks: "local-disk 1 SSD" docker: "ubuntu:24.04" } } Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array. ==Running the Workflow== Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs: echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json Then run it on the cluster with Toil: toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --slurmTime 00:10:00 --caching false --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json Or locally: toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json =Debugging Workflows= Sometimes, your workflow won't work. Try these ideas for figuring out what is going wrong. ==Restarting the Workflow== If you think your workflow failed from a transient problem (such as a Docker image not being available) that you have fixed, and you ran the workflow with <code>--jobStore</code> set manually to a directory that persists between attempts, you can add <code>--restart</code> to your workflow command and make Toil try again. It will pick up from where it left off and rerun any failed tasks and then the rest of the workflow. This will ''will not'' pick up any changes to your WDL source code files; those are read once at the beginning and not re-read on restart. If restarting the workflow doesn't help, you may need to move on to more advanced debugging techniques. ==Debugging Options== When debugging a workflow, make sure to run the workflow with <code>--logDebug</code>, to set the log level to <code>DEBUG</code>, and with <code>--jobStore /some/path/to/a/shared/directory/it/can/create</code> so that the stored files shipped between jobs are in a place you can access them. When debug logging is on, the log from every Toil job is inserted in the main Toil log between these markers: =========> Toil job log is here <========= Normally, only the logs of failing jobs and the output of commands run from WDL are reproduced like this. ==Reading the Log== When a WDL workflow fails, you are likely to see a message like this: WDL.runtime.error.CommandFailed: task command failed with exit status 1 [2023-07-16T16:23:54-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host phoenix-15.prism This means that the command line command specified by one of your WDL tasks exited with a failing (i.e. nonzero) exit code, which will happen when either the command line command is written wrong, or when the error detection code in the tool you are trying to run detects and reports an error. Go up higher in the log until you find lines that look like: [2024-01-16T20:12:19-0500] [Thread-3 (statsAndLoggingAggregator)] [I] [toil.statsAndLogging] hello_caller.0.hello.stderr follows: And [2024-01-16T20:12:19-0500] [Thread-3 (statsAndLoggingAggregator)] [I] [toil.statsAndLogging] hello_caller.0.hello.stdout follows: These will be followed by the standard error and standard output log data from the task's command. There may be useful information (such as an error message from the underlying tool) in there. If you would like individual task logs to be saved separately for later reference, you can use the <code>--writeLogs</code> option to specify a directory to store them. For more information, see [https://toil.readthedocs.io/en/latest/wdl/running.html#managing-workflow-logs the Toil documentation of workflow task logs]. ==Reproducing Problems== When trying to fix a failing step, it is useful to be able to run a command outside of Toil or WDL that might reproduce the problem. In addition to getting the standard output and standard error logs as described above, you may also need input files for your tool in order to do this. === Automatically Fetching Input Files === The <code>toil debug-job</code> command has a <code>--retrieveTaskDirectory</code> option that lets you dump out a directory with all the files that a failing WDL task would use. You can use it like: toil debug-job ./jobstore WDLTaskJob --retrieveTaskDirectory dumpdir If there are multiple failing tasks, you might need to replace <code>WDLTaskJob</code> with the name of one of the failing jobs. See [https://toil.readthedocs.io/en/latest/running/debugging.html#fetching-job-inputs the Toil documentation on retrieving files] for more on how to use this command. === Manually Finding Input Files === If you can't use <code>toil debug-job</code>, you might need to manually dig through the job store for files. In the log of your failing Toil task, look for lines like this: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files: [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-4f886176ab8344baaf17dc72fc445445/toplog.sh' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmprwhi6h3q/toplog.sh' [2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmpjyksfoko/Sample.bam' ... The <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam</code> part is a Toil file ID, and it is a relative path from your <code>--jobStore</code> value to where the file is stored on disk. So if you ran the workflow with <code>--jobStore /private/groups/patenlab/anovak/jobstore</code>, you would look for this file at: /private/groups/patenlab/anovak/jobstore/files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam ==More Ways of Finding Files== Sometimes, a step might not fail, but you still might want to see the files it is using as input. If you have the job store path, you can use the <code>find</code> command to try and find the files by name. For example, if you want to look at <code>Sample.bam</code>, you can look for it like this: find /path/to/the/jobstore -name "Sample.bam" If you want to find files that were ''uploaded'' from a job, look for lines like this in the job's log: [2023-07-16T15:58:39-0700] [MainThread] [D] [toil.wdl.wdltoil] Virtualized /data/tmp/2846b6012e3e5535add03b363950dd78/cb23/197c/work/bamPerChrs/Sample.chr14.bam as WDL file toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam You can take the <code>toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam</code> URI and URL-decode it with, for example, [https://www.urldecoder.io/], getting this: toilfile:2703483274:0:files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam Then you can take the part after the last colon, <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam</code>, and that is the path relative to the job store where this file can be found. ==Using Development Versions of Toil== Sometimes, bugs will be fixed in the development version of Toil, but not released yet. To try the current development version of Toil, you can install it like this: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl,aws,google]' If you want to use a particular branch or commit, like <code>aaa451b320fc115b3563ced25cb501301cf86f90</code>, you can do: pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git@aaa451b320fc115b3563ced25cb501301cf86f90#egg=toil[wdl,aws,google]' ==Frequently Asked Questions== ===I am getting warnings about <code>XDG_RUNTIME_DIR</code>=== You may be seeing warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code> You should upgrade Toil. [https://github.com/DataBiosphere/toil/commit/ff6bf60ab798a675c20156c749817c4313644b96 Since Toil 6.1.0], Toil no longer issues this warning, and just puts up with bad <code>XDG_RUNTIME_DIR</code> settings. ===Toil said it was <code>Redirecting logging</code> somewhere, but I can't find that file!=== The Toil worker process for each job will say that it is <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code>, and when running in single machine mode these messages go to the main Toil log. The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil. If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node. ===How do I delete files in WDL?=== WDL doesn't have a built-in way to delete files; if you run a task that deletes a file, it will still exist in Toil's job store storage. Toil [https://github.com/DataBiosphere/toil/commit/2de6eea2cc2e688b53062a98687445f0cca56669 recently gained support] for deleting files at the ''end'' of WDL workflows. So if you have a large file that you only need for part of your workflow, consider writing the part that creates and uses it as a separate sub-<code>workflow</code> and invoking it with <code>call</code>. Then the file will be cleaned up when the child workflow ends, leaving more space for files created in the parent workflow. =Additional WDL resources= For more information on writing and running WDL workflows, see: * [https://docs.openwdl.org/en/stable/ The WDL dcoumentation] * [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube] * [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification] edc715c14556088509e726beac6ac93a106e7a77